Retain tag/field across events in logstash 1.5 - elasticsearch

I'm using logstash 1.5 to analyze logs.
I want to track two events which occur one after the other.
So I would like to set a flag/field/tag when first event occurs and retain the value across events.
I looked at this link but looks like grep and drop are not supported in logstash 1.5.
Is there a way of achieving this?

The closest you can get with logstash is the elapsed{} filter. You could use that code as a basis for your own filter if it doesn't meet your needs. I also run some external (python) post-processing to do more than elapsed{} can (or should) do.

Related

Apache NiFi instance hangs on the "Computing FlowFile lineage..." window

My Apache NiFi instance just hangs on the "Computing FlowFile lineage..." for a specific flow. Others work, but it won't show the lineage for this specific flow for any data files. The only error message in the log is related to an error in one of the processors, but I can't see how that would affect the lineage, or stop the page from loading.
This was related to two things...
1) I was using the older (but default) provenance repository, which didn't perform well, resulting in the lag in the UI. So I needed to change it...
#nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository
nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
2) Fixing #1 exposed the second issue, which was that the EnforceOrder processor was generating hundreds of provenance events per file, because I was ordering on a timestamp, which had large gaps between the values. This is apparently not a proper use case for the EnforceOrder processor. So I'll have to remove it and find another way to do the ordering.

Re-queue a Logstash event

Is it possible to re-queue a Logstash event to be processed again in a bit of time?
In order to illustrate what I mean, I will explain my use case: I have a custom Logstash filter that extracts the application version from logs at the start of an application, and then appends the correct version to every log event. However in the very beginning, race conditions can occur where an application version has not yet been written to a file, and yet the Logstash filter tries to read in the data anyway (since it it processing log lines concurrently). This results in an application version that is null. In case it matters, Logstash gets its input from Filebeat.
I would like to re-queue these events to be re-processed entirely a couple seconds (or milliseconds) from now, when the application version has been saved to the disk.
However this leads me to a broader question, which is, can you tell a Logstash event to be re-queued, or is there an alternative solution to this scenario?
Thanks for the help!
Process data and append in a new file after that use that file to further process data.
Logstash Processor - 1 Geat Data Proces Data and append to file.
Logstash Processor - 2 Get Data From 2nd File and do whatever you
want to do.

How to conditionally process FlowFile's by a MongoDB query result?

I need to process a list of files based on the result of a MongoDB query, but I can't find any processor that would let me do that. I basically have to take each file and process it or completely discard based on the result of a query that involves that file attributes.
The only MongoDB-related processor that I see in NiFi 1.50 is GetMongo, which apparently can't receive connections, but only emit FlowFiles based on the configured parameters.
Am I looking in the wrong place?
NIFI-4827 is an Improvement Jira that aims to allow GetMongo to accept incoming flow files, the content would contain the query, and the properties will accept Expression Language. The code is still under review, but the intent is to make it available in the upcoming NiFi 1.6.0 release.
As a possible workaround in the meantime, if there is a REST API you could use InvokeHttp to make the call(s) manually and parse the result(s). Also if you have a JDBC driver for MongoDB (such as Unity), you might be able to use ExecuteSQL.

Connecting NiFi to ElasticSearch

I'm trying to solve one task and will appreciate any help - links to documentation, or links to forums, or other FAQs besides https://cwiki.apache.org/confluence/display/NIFI/FAQs, or any meaningful answer in this post =) .
So, I have the following task:
Initial part of my system collects data each 5-15 min from different DB sources. Then I remove duplicates, remove junk, combine data from different sources according to logic and then redirect it to second part of the system as several streams.
As far as I know, "NiFi" can do this task in the best way =).
Currently I can successfully get information from InfluxDB by "GetHTTP" processor. However I can't configure same kind of processor for getting information from Elastic DB with all necessary options. I'd like to receive data each 5-15 minutes for time period from "now-minus-<5-15 minutes>" to "now". (depends on scheduler period) with several additional filters. If I understand it right, this can be achieved either by subscription to "_index" or by regular requests to DB with desired interval.
I know that NiFi has several specific Processors designed for Elasticsearch (FetchElasticsearch5, FetchElasticsearchHttp, QueryElasticsearchHttp, ScrollElasticsearchHttp) as well as GetHTTP and PostHTTP Processors. However, unfortunately, I have lack of information or even better - examples - how to configure their "Properties" for my purposes =(.
What's the difference between FetchElasticsearchHttp, QueryElasticsearchHttp? Which one fits better for my task? What's the difference between GetHTTP and QueryElasticsearchHttp besides several specific fields? Will GetHTTP perform the same way if I tune it as I need?
Any advice?
I will be grateful for any help.
The ElasticsearchHttp processors try to make it easier to interact with ES by generating the appropriate REST API call based on the properties you set. If you know the full URL you need, you could use GetHttp or InvokeHttp. However the ESHttp processors let you put in just the stuff you're looking for, and it will generate the URL and return the results.
FetchElasticsearch (and its variants) is used to get a particular document when you know the identifier. This is sometimes used after a search/query, to return documents one at a time after you know which ones you want.
QueryElasticsearchHttp is for when you want to do a Lucene-style query of the documents, when you don't necessarily know which documents you want. It will only return up to the value of index.max_result_window for that index. To get more records, you can use ScrollElasticsearchHttp afterwards. NOTE: QueryElasticsearchHttp expects a query that will work as the "q" parameter of the URL. This "mini-language" does not support all fields/operators (see here for more details).
For your use case, you likely need InvokeHttp in order to issue the kind of query you describe. This article describes how to issue a query for the last 15 minutes. Once your results are returned, you might need some combination of EvaluateJsonPath and/or SplitJson to work with the individual documents, see the Elasticsearch REST API documentation (and NiFi processor documentation) for more details.

Logstash aggregation based on 'temporary id'

I'm not sure if this sort of aggregation is best done after being indexed by elasticsearch or if logstash is a good place to do it.
We are logging information about commands run against a server. Each set of metrics regarding a single command is logged as a single log event, there are multiple 'metric sets' per command. Each metric is of its own document type in ES (currently at least). So we will have multiple events across multiple documents regarding one command run against the server.
Each of these events will have a 'cmdno' field which is a temporary id given to the command we are logging about. Once the command has finished with all events logged, the 'cmdno' may be reused for other commands.
Is it possible to use logstash 'aggregate' plugin to link the events of a single command together using the 'cmdno'? (or any plugin)
All events that pertain to a single command will have the same timestamp + cmdno. I would like to add a UUID to the events as a permanent unique id for that command, so that a single query will give us all events for that single command.
Was thinking along the lines of:
if [cmdno] {
aggregate {
task_id => "%{cmdno}"
code => "map['cmdid'] ||= <some uuid generator>; event['cmdid'] == map['cmdid'] ? event['#timestamp'] == map['<stored timestamp for previous event from the same command>'] : continue"
}
}
Just started learning the ELK stack, not entirely sure as to the programming contructs logstash affords me yet.
I don't know if there is a better way to relate these events, this seemed the most suitable for our needs, if there are more ELK'y methods please let me know, they do need to stay as separate documents of different types though.
Any help much appreciated, let me know if I am missing anything.
Cheers,
Brett

Resources