update a get request to elastic search in nifi - apache-nifi

I'm trying to hit a get request on my elastic search index using JSONelastisearchprocessor in NIFI.
Now, for each flow file i have some incoming attributes, based on that i need to generate different get request and store the response somewhere.
The list of processors i'm using is as below:
Getfile (to read JSON file)
Evaluate JSON path (To extract the attributes which i want to use with every get request, PROC_INST_ID_ in this case )
JSON queryelastic search (to hit the request with the below )
PUTfile to store the response
Request body
{
"query": {
"nested": {
"path": "los",
"query": {
"bool": {
"must": [
{ "match": { "los.${proc_ins_id}":"784525" }},
{ "match": { "los._source.cibilPermission.VALUE_":"1" }}
]
}
}
}
}
}
I can't see the request being genrated and not getting any response instead i'm only getting the value of proc_ins_id as reponse in putfile. Can someone suggest some appropriate way to do this?
Attaching relevant screenshots as well for reference.

I'm assuming you are providing the request body in the Query property of JsonQueryElasticSearch. In that case, you should set the Destination property to flowfile-attribute in EvaluateJsonPath because if it is set to flowfile-content and the Query property is configured with an actual query, JsonQueryElasticSearch won't even read the content of the flowfile.
And also connect hits and original to two different processors because if you connect them to the same processor you would get the original flowfile which was updated at EvaluateJsonPath at the directory configured in PutFile. In general, people would auto-terminate the original relationships unless there is a need to have it. You may also need to have the aggregations relationship configured as well because the aggregation results are sent to that relationship.

Related

Is it possible to set several routing values using Elasticsearch NEST?

I need to query data from several shards. Elasticsearch REST API provides a possibility to send a request with several routing keys:
//https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html#_searching_with_custom_routing
GET my-index-000001/_search?routing=user1,user2
{
"query": {
"match": {
"title": "document"
}
}
}
Is it possible to do the same by NEST client?
Yes, you can pass a comma-separated string to the Routing() method of a search request.

Best practices for writing a PUT endpoint for a REST API

I am building a basic CRUD service with some business logic under the hood, and I'm about to start working on the PUT (update) endpoint. I have already fully written+tested GET (read) and POST (create) for my data object. The data store for my documents is an ElasticSearch instance on AWS.
I have some decisions to make about how I want to architect the PUT, namely, how I want to determine a valid request. My goal is to make it so the POST is only for the creation of new assets, and PUT will only update existing documents. (At the moment, I am POSTing to elastic with /_doc/, however the intent is to move to /_create/ as part of this work)
What I'm a little hung-up on is the "right" way to check that a document exists before making the API call to Elastic to update.
When a user submits a document to PUT, should I first GET from Elastic with the document ID to make sure the document already exists? Or should I simply try to "update" the resource and if it doesn't exists, one is created?
Obviously there are trade-offs to each strategy. With the latter, PUTing a document that doesn't exist almost completely negates the need for a POST at all, so I'd be more inclined to go with the former - despite the additional REST call - to maintain the integrity of the basic REST definition.
Thoughts?
The consideration whether to update a doc (with versioning) or create a new one with some shared ID related to all previous versions depends on your use case -- either of them are 'correct' but there's too little information to advise on that right now.
With regards to the document-exists strategies -- there are essentially 2 types of IDs in ES -- what I call:
internal ids (_id)
external ids (doc_values-provided ids)
Create an index & a doc:
PUT myindex
PUT myindex/_doc/internal_id_1
{
"external_id": "1"
}
Internal ID check
GET myindex/_doc/internal_id_1
or
GET myindex/_count
{
"query": {
"ids": {
"values": [
"internal_id_1"
]
}
}
}
or
GET myindex/_count
{
"query": {
"term": {
"_id": {
"value": "internal_id_1"
}
}
}
}
External ID check
GET myindex/_count
{
"query": {
"term": {
"external_id": {
"value": "1"
}
}
}
}
and many others (terms, match (for partial matches etc), ...)
Note that I've used the _count endpoint instead of _search -- it's slightly faster.
If you intend to check the _version of a given doc before you proceed to update it, replace _count with _search?version=true and the _version attribute will become available:
{
"_index":"myindex",
"_type":"_doc",
"_id":"internal_id_1",
"_version":2, <---
"_score":1.0,
"_source":{
"external_id":"1"
}
}

How to extract and visualize values from a log entry in OpenShift EFK stack

I have an OKD cluster setup with EFK stack for logging, as described here. I have never worked with one of the components before.
One deployment logs requests that contain a specific value that I'm interested in. I would like to extract just this value and visualize it with an area map in Kibana that shows the amount of requests and where they come from.
The content of the message field basically looks like this:
[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}
This plz is a German zip code, which I would like to visualize as described.
My problem here is that I have no idea how to extract this value.
A nice first success would be if I could find it with a regexp, but Kibana doesn't seem to work the way I think it does. Following its docs, I expect this /\"plz\":\"[0-9]{5}\"/ to deliver me the result, but I get 0 hits (time interval is set correctly). Even if this regexp matches, I would only find the log entry where this is contained and not just the specifc value. How do I go on here?
I guess I also need an external geocoding service, but at which point would I include it? Or does Kibana itself know how to map zip codes to geometries?
A beginner-friendly step-by-step guide would be perfect, but I could settle for some hints that guide me there.
It would be possible to parse the message field as the document gets indexed into ES, using an ingest pipeline with grok processor.
First, create the ingest pipeline like this:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
}
]
}
Then, when you index your data, you simply reference that pipeline:
PUT plz/_doc/1?pipeline=parse-plz
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}"""
}
And you will end up with a document like the one below, which now has a field called plz with the 12345 value in it:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345"
}
When indexing your document from Fluentd, you can specify a pipeline to be used in the configuration. If you can't or don't want to modify your Fluentd configuration, you can also define a default pipeline for your index that will kick in every time a new document is indexed. Simply run this on your index and you won't need to specify ?pipeline=parse-plz when indexing documents:
PUT index/_settings
{
"index.default_pipeline": "parse-plz"
}
If you have several indexes, a better approach might be to define an index template instead, so that whenever a new index called project.foo-something is created, the settings are going to be applied:
PUT _template/project-indexes
{
"index_patterns": ["project.foo*"],
"settings": {
"index.default_pipeline": "parse-plz"
}
}
Now, in order to map that PLZ on a map, you'll first need to find a data set that provides you with geolocations for each PLZ.
You can then add a second processor in your pipeline in order to do the PLZ/ZIP to lat,lon mapping:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
},
{
"script": {
"lang": "painless",
"source": "ctx.location = params[ctx.plz];",
"params": {
"12345": {"lat": 42.36, "lon": 7.33}
}
}
}
]
}
Ultimately, your document will look like this and you'll be able to leverage the location field in a Kibana visualization:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345",
"location": {
"lat": 42.36,
"lon": 7.33
}
}
So to sum it all up, it all boils down to only two things:
Create an ingest pipeline to parse documents as they get indexed
Create an index template for all project* indexes whose settings include the pipeline created in step 1

Elastic search: Delete by query isn't working

Introduction
I'm using Elastic Search (v5.x) and trying to delete documents, by query.
My index called "data". The documents are stored in hierarchic structure. Documents URL built in this pattern:
https://server.ip/data/{userid}/{document-id}
So, let's say the user-id '1' have two documents stored ('1', '2'). So, their direct URL will be:
https://server.ip/data/1/1
https://server.ip/data/1/2
Target
Now, what I'm trying to do is to delete the user from the system (the user and his stored documents).
The only way that worked for me is to send HTTP DELETE request for each document URL. Like this:
DELETE https://server.ip/data/1/1
DELETE https://server.ip/data/1/2
This is working. But, in this solution I have to call delete multiple times. I want to delete all the documents in one call. So, this solution is rejected.
My first try was to send HTTP DELETE request to
https://server.ip/data/1
Unfortently, it's not working (error code 400).
My second try was to use the _delete_by_query function. Each document that I'm stored is containing the UserId field, which contain the UserId. So, I tried to make a delete query for removing all the documents, in 'data' index, that containing the field with the value 1 ('UserId'==1)
POST https://server.ip/data/_delete_by_query
{
"query":{
"match":{
"UserId":"1"
}
}
}
This also not working. The response was HTTP Error Code 400 with this body:
{
"error":{
"root_cause":[
{
"type":"invalid_type_name_exception",
"reason":"Document mapping type name can't start with '_'"
}
],
"type":"invalid_type_name_exception",
"reason":"Document mapping type name can't start with '_'"
},
"status":400
}
Do you know how to solve those problems? Maybe do you have alternative solution?
Thank you!
I assume you've got your document_type defined in your logstash conf something like this within your output>elasticsearch:
output {
elasticsearch {
index => "1"
document_type => "1type"
hosts => "localhost"
}
stdout {
codec => rubydebug
}
}
Hence you could simply delete all the documents which has the same type:
curl -XDELETE https://server.ip/data/1/1type
OR try something like this if you're willing to use delete by query:
POST https://server.ip/data/_delete_by_query?UserId=1
{
"query": {
"match_all": {}
}
}
This could be an absolute gem of a source. Hope it helps!

Why am I not getting expected results when searching in ElasticSearch using chrome plugin Sense?

So I've setup the following data set so I can test searching on an field storing multiple values:
post /test/participant
{
"Synonyms" : [ "foo" ]
}
post /test/participant
{
"Synonyms" : [ "bar" ]
}
post /test/participant
{
"Synonyms" : [ "foo", "bar" ]
}
I've tried to get some data back by trying something like:
get /test/participant/_search
{
"query": {
"filtered": {
"filter": {
"term": { "Synonyms": "foo" }
}
}
}
}
and I was expecting to get back the first and third records (see order above). However, I keep on getting all the records back. I've tried no end of alerations to the query to try and get something sensible (there's not enough space to add them here) and all I keep on getting is all the records in the index. Does anyone have an idea how I would query to get back those records with "foo" as a value (1st and 3rd)? And is there some subtle point I've been missing here? I'm aware that ElasticSearch does not store the values as an array but as an unordered collection.
I think you are running these queries in Sense, right?
The commands you need are these:
POST /test/participant
{"Synonyms":["foo"]}
POST /test/participant
{"Synonyms":["bar"]}
POST /test/participant
{"Synonyms":["foo","bar"]}
GET /test/participant/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"Synonyms": "foo"
}
}
}
}
}
The explanation is related to GET vs. POST http methods.
Behind the scene Sense actually converts a GET request to a HTTP POST (given that many browsers do not support HTTP GET requests with a request body). This means that, even if you write GET, the actual http request is a POST.
Because Sense has the autocomplete that forces upper case letters for request methods, it uses the same upper case letters when deciding if it's a GET (and not a get) request together with a request body. If it is, then that request is transformed to a POST one. If it compares the request method and decides is not a GET it sends the request as is, meaning with a get method and with a body. Since the body is ignored, what reaches Elasticsearch will be a get /test/participant/_search which is basically a match_all which, of course, returns all documents :-).

Resources