Connecting NiFi to ElasticSearch - elasticsearch

I'm trying to solve one task and will appreciate any help - links to documentation, or links to forums, or other FAQs besides https://cwiki.apache.org/confluence/display/NIFI/FAQs, or any meaningful answer in this post =) .
So, I have the following task:
Initial part of my system collects data each 5-15 min from different DB sources. Then I remove duplicates, remove junk, combine data from different sources according to logic and then redirect it to second part of the system as several streams.
As far as I know, "NiFi" can do this task in the best way =).
Currently I can successfully get information from InfluxDB by "GetHTTP" processor. However I can't configure same kind of processor for getting information from Elastic DB with all necessary options. I'd like to receive data each 5-15 minutes for time period from "now-minus-<5-15 minutes>" to "now". (depends on scheduler period) with several additional filters. If I understand it right, this can be achieved either by subscription to "_index" or by regular requests to DB with desired interval.
I know that NiFi has several specific Processors designed for Elasticsearch (FetchElasticsearch5, FetchElasticsearchHttp, QueryElasticsearchHttp, ScrollElasticsearchHttp) as well as GetHTTP and PostHTTP Processors. However, unfortunately, I have lack of information or even better - examples - how to configure their "Properties" for my purposes =(.
What's the difference between FetchElasticsearchHttp, QueryElasticsearchHttp? Which one fits better for my task? What's the difference between GetHTTP and QueryElasticsearchHttp besides several specific fields? Will GetHTTP perform the same way if I tune it as I need?
Any advice?
I will be grateful for any help.

The ElasticsearchHttp processors try to make it easier to interact with ES by generating the appropriate REST API call based on the properties you set. If you know the full URL you need, you could use GetHttp or InvokeHttp. However the ESHttp processors let you put in just the stuff you're looking for, and it will generate the URL and return the results.
FetchElasticsearch (and its variants) is used to get a particular document when you know the identifier. This is sometimes used after a search/query, to return documents one at a time after you know which ones you want.
QueryElasticsearchHttp is for when you want to do a Lucene-style query of the documents, when you don't necessarily know which documents you want. It will only return up to the value of index.max_result_window for that index. To get more records, you can use ScrollElasticsearchHttp afterwards. NOTE: QueryElasticsearchHttp expects a query that will work as the "q" parameter of the URL. This "mini-language" does not support all fields/operators (see here for more details).
For your use case, you likely need InvokeHttp in order to issue the kind of query you describe. This article describes how to issue a query for the last 15 minutes. Once your results are returned, you might need some combination of EvaluateJsonPath and/or SplitJson to work with the individual documents, see the Elasticsearch REST API documentation (and NiFi processor documentation) for more details.

Related

nifi-api: List all processors with their configuration

I want to list all my ListenHTTP processor URLs so I can select and kick off different flows.
Is it possible with Nifi API query to list all processors with their configuration (in my case looking to get 'Base Path' and 'Listening Port') ?
Looking for a query that will return this info only (not the full processor details).
I can get an individual processor by name.
https://<IP-ADDRESS>:9443/nifi-api/flow/search-results?q=MyProcessor
Then parse out the processor's id from this result.
And with id get the processor's full details.
https://<IP-ADDRESS>:9443/nifi-api/processors/<PROCESSOR-ID>
But then I would have to parse out the config properties (and would have to repeat for each processor).
This seems a roundabout way of solving the problem.
Any help would be much appreciated.
Thanks
**** EDIT:
Best solution I can see at the moment is still a 2 step approach.
Get everything for ListenHTTP
https://<IP-ADDRESS>:9443/nifi-api/flow/search-results?q=ListenHTTP
This will return multiple Json arrays, where we want the 'processorResults'
Parse this (in Java) to get processor name and id.
Then (as above) get processor by 'id' and parse out config.
https://<IP-ADDRESS>:9443/nifi-api/processors/<PROCESSOR-ID>
You can use Python and NiPyAPI to recurse through the flow and get all the processors, then you'd filter on ListenHttp processors. You can also use NiPyAPI to kick off the desired flows, it is a very handy tool.

Is there a way to combine a query and a command in CQRS?

I have a project built using CQRS, but I can't figure out how to implement one use case.
The user needs to be able to make a Query which will return a set of data for them to view. However, I also need to save the data they got at the same time.
Is there a way to do this within a Query without violating CQRS' principles? Or would the Query and Command need to be two separate API calls one after another?
In CQRS it is your client who can do both command and queries. This client is not necessary a piece of UI.
It can be an API endpoint handler, which would
receive a query
forward it to the query endpoint
wait for the answer
send an answer to the caller
send a command to store the answer
Is there a way to do this within a Query without violating CQRS' principles?
It depends.
If "save the data" means "make some change to the domain model"... well, that would be pretty weird.
Asking a question should not change the answer. -- Bertrand Meyer
On the other hand, logging/telemetry are pretty normal ways to track the activity of an application, so that should be fine.
There are some realities of a distributed system on an unreliable network that you need to be aware of (what should the behavior be if the telemetry system is not available? What are the consequences of recording queries that don't actually reach the client (because the network is unreliable).
As #VoiceOfUnreason stated, it may be somewhat strange to effect domain changes when querying data.
However, it may be that you could swop that around.
For instance, perhaps one could query a forecast of sorts. We would want to store that forecast. It then seems as though the query results in us having to save the result. This appears to break CQS at some level since each query would result in a change of state.
If we swop that around and first request a forecast via the domain handling and then that produces a result, or even a pointer to the result, then the query would be something you could perform on the data multiple times without "breaking" CQS.

How to conditionally process FlowFile's by a MongoDB query result?

I need to process a list of files based on the result of a MongoDB query, but I can't find any processor that would let me do that. I basically have to take each file and process it or completely discard based on the result of a query that involves that file attributes.
The only MongoDB-related processor that I see in NiFi 1.50 is GetMongo, which apparently can't receive connections, but only emit FlowFiles based on the configured parameters.
Am I looking in the wrong place?
NIFI-4827 is an Improvement Jira that aims to allow GetMongo to accept incoming flow files, the content would contain the query, and the properties will accept Expression Language. The code is still under review, but the intent is to make it available in the upcoming NiFi 1.6.0 release.
As a possible workaround in the meantime, if there is a REST API you could use InvokeHttp to make the call(s) manually and parse the result(s). Also if you have a JDBC driver for MongoDB (such as Unity), you might be able to use ExecuteSQL.

How can I see what is happening under the covers when an Elasticsearch query is executed?

For Elasticsearch 1.7.5 (or earlier), how can I see what steps Elasticsearch takes to handle my queries?
I attempted to turn debugging on by setting es.logger.level=DEBUG, but while that produced a fair amount of information at startup and shutdown, it doesn't produce anything when queries are executed. Looking at the source code, apparently most of the debug logging for searches is just for exceptional situations.
I am trying to understand query performance. We're seeing Elasticsearch do way more I/O than we expected, on a very simple term query on an unanalyzed field.
With ES 1.7.5 and earlier versions, you can use the ?explain=true URL parameter when sending your query and you'll get some more insights into how the score was computed.
Also starting with ES 2.2, there is a new Profile API which you can use to get more insights into timing information while the different query components are being executed. Simply add "profile": true to the search body payload and you're good to go.

Logstash aggregation based on 'temporary id'

I'm not sure if this sort of aggregation is best done after being indexed by elasticsearch or if logstash is a good place to do it.
We are logging information about commands run against a server. Each set of metrics regarding a single command is logged as a single log event, there are multiple 'metric sets' per command. Each metric is of its own document type in ES (currently at least). So we will have multiple events across multiple documents regarding one command run against the server.
Each of these events will have a 'cmdno' field which is a temporary id given to the command we are logging about. Once the command has finished with all events logged, the 'cmdno' may be reused for other commands.
Is it possible to use logstash 'aggregate' plugin to link the events of a single command together using the 'cmdno'? (or any plugin)
All events that pertain to a single command will have the same timestamp + cmdno. I would like to add a UUID to the events as a permanent unique id for that command, so that a single query will give us all events for that single command.
Was thinking along the lines of:
if [cmdno] {
aggregate {
task_id => "%{cmdno}"
code => "map['cmdid'] ||= <some uuid generator>; event['cmdid'] == map['cmdid'] ? event['#timestamp'] == map['<stored timestamp for previous event from the same command>'] : continue"
}
}
Just started learning the ELK stack, not entirely sure as to the programming contructs logstash affords me yet.
I don't know if there is a better way to relate these events, this seemed the most suitable for our needs, if there are more ELK'y methods please let me know, they do need to stay as separate documents of different types though.
Any help much appreciated, let me know if I am missing anything.
Cheers,
Brett

Resources