Riemann to InfluxDB event drop - events

I have a simple setup which uses filebeat and topbeat to forward data to Logstash, which further forwards it to Riemann, which in turn sends it to InfluxDB 0.9. I use Logstash to split an event into multiple events, all of which show up on Riemann logs (with the same timestamp). However, only one of these split events reaches my InfluxDB. Any help please?

In InfluxDB 0.9, a point is uniquely identified by the measurement name, full tag set, and the timestamp. If another point arrives later with identical measurement name, tag set, and timestamp, it will silently overwrite the previous point. This is intentional behavior.
Since your timestamps are identical and you're writing to the same measurement, you must ensure that your tag set differs for each point you want to record. Even something like fuzz=[1,2,3,4,5] will work to differentiate the points.

Related

logs and metric values don't match

I'm trying to configure an alert for a log based metric in Google Cloud Monitoring. In my sample time frame, there are two log entries I'm interested in.
Using the metrics explorer, I build a query for the metric, but the values in the metric explorer don't make sense. For the first entry the metric explorer shows a value of 4, and for the second log entry, there are two bars one with a value of 1 and one with a value of 2.
It doesn't make any sense! Does anyone know how to properly configure this?
this is how it shows in metrics explorer
and this is the underlying data
note my local time is UTC+3 hence the timestamp offset.

How to check properties before update in elasticsearch?

I've already read official documentation and find no way.
My datas to es are from kafka which sometimes can be out of order. In the past, message from kafka is parsed and directly insert or update ES doc with specific ID. To avoid the older data override the newer data, I have to check whether the doc with specific ID is already exists and some properties of this doc are meet the conditions. Then I do the UPDATE action(or INSERT).
What I'm doing now is 'search before update'.
Before updating a doc, I search from ES with specific ID(included in kafka msg). Then check if this doc meets the conditions(for example, whether update_time is older?). Lastly I update the doc. And I set refresh to true to update index instantly.
What I'm worried about?
It seems Transactional.
If there is only one Thread executing synchronously, is it possible that When I process next message the doc updated in last message process is not refresh at ES?
If I have several Threads consuming kafka message, how to check before update? Can I use script to solve this problem?
If there is only one Thread executing synchronously, is it possible that When I process next message the doc updated in last message process is not refresh at ES?
That is a possibility since indexes are refreshed once in every second (by default), reducing this value is neither recommended nor guaranteed to give you the desired result since Elasticsearch is NOT designed for this.
If I have several Threads consuming kafka message, how to check before update? Can I use script to solve this problem?
You can use script if the number of fields being updated are very limited. Personally I have found script to be best suited for single field update and that too for corner use cases, it should not be used as a general practice. Any more than that and you are running into the same risk as that with stored procedures in the RDBMS world. It makes data management volatile overall and a system which is harder to maintain/extend in the longer run.
Your use case is best suited for optimistic locking support available from Elasticsearch out of the box. Take a look at Elasticsearch Versioning Support for full details.
You can very well use the inbuilt doc version if concurrency is the only problem that you need to solve. If, however, you need more than concurrency (out of order message delivery and respective ES updates) then you should use your application/domain specific field as the inbuilt version wouldn't work as-is.
You can very well use any of the app specific (numeric) field as a version field and use it for optimistic locking during document updates. If you use this approach, please pay special attention to all insert, update, delete operations for that index. Quoting AS-IS from versioning support - when using external versioning, make sure you always add the current version (and version_type) to any index, update or delete calls. If you forget, Elasticsearch will use it's internal system to process that request, which will cause the version to be incremented erroneously
I'll recommend you evaluate the inbuilt version first and use it if it fulfills your needs. It'll make the overall design much simpler. Consider the app specific version as the second option if the inbuilt version does not meet your requirements.
If there is only one Thread executing synchronously, is it possible that When I process next message the doc updated in last message process is not refresh at ES?
Ad 1. It is possible to save data in ElasticSearch and in a short while after receive stale result (before the index is updated)
If I have several Threads consuming kafka message, how to check before update? Can I use script to solve this problem?
Ad 2. If you process Kafka messages in several threads, it would be the best to use business data (eg. some business ids) as partition keys in Kafka to ensure data is processed in order. Remember to use Kafka to consume messages in many threads and don't consume messages by single consumer to fan out later to multiple threads.
It seems it would be best to ensure data is processed in order and then drop checking in Elasticsearch since it is not guaranteed to give valid results.

Does the Kafka streams aggregation have any ordering guarantee?

My Kafka topic contains statuses keyed by deviceId. I would like to use KStreamBuilder.stream().groupByKey().aggregate(...) to only keep the latest value of a status in a TimeWindow. I guess that, as long as the topic is partitioned by key, the aggregation function can always return the latest values in this fashion:
(key, value, older_value) -> value
Is this a guarantee I can expect from Kafka Streams? Should I roll my own processing method that checks the timestamp?
Kafka Streams guaranteed ordering by offsets but not by timestamp. Thus, by default "last update wins" policy is based on offsets but not on timestamp. Late arriving records ("late" defined on timestamps) are out-of-order based on timestamps and they will not be reordered to keep original offsets order.
If you want to have your window containing the latest value based on timestamps you will need to use Processor API (PAPI) to make this work.
Within Kafka Streams' DSL, you cannot access the record timestamp that is required to get the correct result. A easy way might be to put a .transform() before .groupBy() and add the timestamp to the record (ie, its value) itself. Thus, you can use the timestamp within your Aggregator (btw: a .reduce() that is simpler to use might also work instead of .aggregate()). Finally, you need to do .mapValues() after your .aggregate() to remove the timestamp from the value again.
Using this mix-and-match approach of DSL and PAPI should simplify your code, as you can use DSL windowing support and KTable and do not need to do low-level time-window and state management.
Of course, you can also just do all this in a single low-level stateful processor, but I would not recommend it.

influxdb creating a new measurement

new to Influxdb but liking it a lot
I've configured it gather metrics from snmp polled devices - primarily network nodes
I can happily graph the statistics polled using derived values but what I want to know
Is it possible to create a new measurement in influxdb from data already stored?
The use case is we poll network traffic and graph it by doing the derived difference between the current and last reading (grafana)
What I want to do is create a measurement that does that in the influxdb and stores it. This is primarily so I can setup monitoring of the new derived value using a simple query and alert if it drops below x.
I have a measurement snmp_rx / snmp_tx with host and port name with the polled ifHCInOctets and ifHCOutOctets
so can I do a process that continuously creates a new measurement for each showing the difference between current and last readings?
Thanks
Apparently influxdb feature you are looking for is called continuous queries :
A CQ is an InfluxQL query that the system runs automatically and
periodically within a database. InfluxDB stores the results of the CQ
in a specified measurement
It will allow you to automatically create and fill new octet rates measurements from raw ifHCInOctet/ifHCOutOctets counters you have using derivative function in select statement and configured group by time interval. You can also do some scaling in select expression (like bytes-to-bits, etc).

InfluxDB reusing old value?

I have been switching from statsd + graphite + grafana to using influxdb instead of graphite. However somehow InfluxDB behaves a bit differently than graphite used to when it comes to missing values.
If a timeseries does not produce new points for a period of time, the plot in Grafana will continue to show the last value written:
This happens even when specifying fill(0) or fill(null) in the query. When using the Data Interface of InfluxDB it also seems to be filling using the previous values:
Since I have some alerting that will be triggered by missing values, having the old values reused disables my alerts.
Any idea on how to fix this?
If you want to show continuous graph, then there is a hack.
Apply mean() and group by()
For example, something like this:
Select mean("fieldName") from measurement where time > now() -1h group by time(10s) fill(0)

Resources