How to merge splitted FlowFiles with the data from Elasticsearch? - elasticsearch

I have the problem with merging splitted FlowFiles. let me explain the problem step by step.
This is my sequence of processors.
In Elasticsearch I have this index and mapping:
PUT /myindex
{
"mappings": {
"myentries": {
"_all": {
"enabled": false
},
"properties": {
"yid": {"type": "keyword"},
"days": {
"properties": {
"Type1": { "type": "date" },
"Type2": { "type": "date" }
}
},
"directions": {
"properties": {
"name": {"type": "keyword"},
"recorder": { "type": "keyword" },
"direction": { "type": "integer" }
}
}
}
}
}
}
I get directions from Elasticsearch using QueryElasticsearchHTTP and then I split directions into using SplitJson in order to get 10 FlowFiles. Each FlowFile has this content: {"name": "X","recorder": "X", "direction": "X"}
After this, for each of 10 FlowFiles I generate the attribute filename using UpdateAttribute and ${UUID()}. Then, I enrich each FlowFile with some constant data from ElasticSearch. In fact, the data that I merge to each FlowFile is the same. Therefore, ideally, I would like to run Get constants from Elastic only once instead of running it 10 times.
But anyway the key problem is different. FlowFiles that come from Gets constants from Elastic have other values of filename and they cannot be merged with the files that come from Set the attribute "filename". I also tried to use EvaluateJsonPath, but had the same problem. Any idea of how to solve this problem?
UPDATE:
The Groovy code used in Merge inputs.... I am not sure if it works when in the input queues come batches of 10 and 10 files that should be merged:
import org.apache.nifi.processor.FlowFileFilter;
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
//get first flow file
def ff0 = session.get()
if(!ff0)return
def filename = ff0.getAttribute('filename')
//try to find files with same attribute in the incoming queue
def ffList = session.get(new FlowFileFilter(){
public FlowFileFilterResult filter(FlowFile ff) {
if( filename == ff.getAttribute('filename') )return FlowFileFilterResult.ACCEPT_AND_CONTINUE
return FlowFileFilterResult.REJECT_AND_CONTINUE
}
})
//let's assume you require two additional files in queue with the same attribute
if( !ffList || ffList.size()<1 ){
session.rollback(false)
return
}
//let's put all in one list to simplify later iterations
ffList.add(ff0)
//create empty map (aka json object)
def json = [:]
//iterate through files parse and merge attributes
ffList.each{ff->
session.read(ff).withStream{rawIn->
def fjson = new JsonSlurper().parse(rawIn)
json.putAll(fjson)
}
}
//create new flow file and write merged json as a content
def ffOut = session.create()
ffOut = session.write(ffOut,{rawOut->
rawOut.withWriter("UTF-8"){writer->
new JsonBuilder(json).writeTo(writer)
}
} as OutputStreamCallback )
//set mime-type
ffOut = session.putAttribute(ffOut, "mime.type", "application/json")
session.remove(ffList)
session.transfer(ffOut, REL_SUCCESS)

Related

How to reindex into an existing index without erasing previous data

I'm using the reindex API to adapt data from an old format into a new format like so:
POST /_reindex
{
"source": {
"index": "old_index"
},
"dest": {
"index": "new_index"
},
"script": {
"source": """
ArrayList convertField(def str) {
// [complicated conversion]
return reformatted_data;
}
ctx._source.specific_field = convertField(ctx._source.specific_field);
"""
}
}
For the sake of a load test I would like to duplicate the data into the new index (it doesnt need to be exaclty the same, some scripted alterations would be fine).
The problem is, everytime I run the reindex, all data in the target index is deleted and replaced bu the new batch. How do I keep the current data and add to it, instead of replacing?
The easiest way is to set the _id field of the reindexed documents to null, using the script field. This will generate a new GUID for the reindexed document. In your case:
POST /_reindex
{
"source": {
"index": "old_index"
},
"dest": {
"index": "new_index"
},
"script": {
"source": """
ArrayList convertField(def str) {
// [complicated conversion]
return reformatted_data;
}
ctx._source.specific_field = convertField(ctx._source.specific_field);
ctx._id = null
"""
}
}
The reason is that can only be one document with a given _id.
For my test all I did was to edit each _id.
ctx._id = ctx._id + "_1";

ElasticsearchTransport use transform to change indexPrefix?

UPDATE: I can actually change the indexPrefix using the below code, but the actual _index which is used to filter in Kibana gets its name from the original indexPrefix. It seems changing the indexPrefix in the transformer method is too late, because the _index has already been created with the original prefix.
I'm using winston and winston-elasticsearch in a nodejs/express setup and want to use the same logger to log to different indexes (different indexPrefix)
const logger = winston.createLogger({
transports
});
The transports is an array of different transports. One of them is an ElasticsearchTransport that takes in some parameters like indexPrefix and level among others. The level can be changed based on the type of log by passing in a transformer method as parameter.
new winstonElastic.ElasticsearchTransport({
level: logLevel,
indexPrefix: getIndex(),
messageType: 'log',
ensureMappingTemplate: true,
mappingTemplate: indexTemplateMapping,
transformer: (logData: any) => {
const transformed: any = {};
transformed['#timestamp'] = logData.timestamp ? logData.timestamp : new Date().toISOString();
transformed.message = logData.message;
transformed.level = parseWinstonLevel(logData.level);
transformed.fields = _.extend({}, staticMeta, logData.meta);
transformed.indexPrefix = getIndex(logData.indexPrefix);
return transformed;
},
The transformer method is called whenever the logger writes a new entry and i can verify that it works by setting properties like message. It also overwrites the level to whatever the current log level is. For some reason it doesn't work on the property indexPrefix - even when it changes, nothing overwrites the initial indexPrefix. I even tried to remove the initial value but then the logging fails, having never set the indexPrefix.
Does anyone know why? Does it have to do with the mappingTemplate listed below?:
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"index": {
"refresh_interval": "5s"
}
},
"mappings": {
"properties": {
"#timestamp": { "type": "date" },
"#version": { "type": "keyword" },
"message": { "type": "text" },
"severity": { "type": "keyword" },
"fields": {
"dynamic": true,
"properties": {}
}
}
}
}
Ok, if anyone is interrested. I ended up making a loggerFactory instead. I create a logger seeded with the correct indexPrefix through the factory - that way I end up with one logger instance per indexPrefix i want...
For those who are having the same problem i solved this in a different way.
1 - I created a variable inside ElasticsearchTransport scope;
2 - I changed the value of the variable inside the transformer method;
3 - I used the variable inside the indexPrefix method to define which prefix to use:
indexPrefix: () => return variable ? 'test1' : 'test2'

How to create a HashMap with custom object as a key?

In Elasticsearch, I have an object that contains an array of objects. Each object in the array have type, id, updateTime, value fields.
My input parameter is an array that contains objects of the same type but different values and update times. Id like to update the objects with new value when they exist and create new ones when they aren't.
I'd like to use Painless script to update those but keep them distinct, as some of them may overlap. Issue is that I need to use both type and id to keep them unique. So far I've done it with bruteforce approach, nested for loop and comparing elements of both arrays, but I'm not too happy about that.
One of the ideas is to take array from source, build temporary HashMap for fast lookup, process input and later store all objects back into source.
Can I create HashMap with custom object (a class with type and id) as a key? If so, how to do it? I can't add class definition to the script.
Here's the mapping. All fields are 'disabled' as I use them only as intermidiate state and query using other fields.
{
"properties": {
"arrayOfObjects": {
"properties": {
"typ": {
"enabled": false
},
"id": {
"enabled": false
},
"value": {
"enabled": false
},
"updated": {
"enabled": false
}
}
}
}
}
Example doc.
{
"arrayOfObjects": [
{
"typ": "a",
"id": "1",
"updated": "2020-01-02T10:10:10Z",
"value": "yes"
},
{
"typ": "a",
"id": "2",
"updated": "2020-01-02T11:11:11Z",
"value": "no"
},
{
"typ": "b",
"id": "1",
"updated": "2020-01-02T11:11:11Z"
}
]
}
And finally part of the script in it's current form. The script does some other things, too, so I've stripped them out for brevity.
if (ctx._source.arrayOfObjects == null) {
ctx._source.arrayOfObjects = new ArrayList();
}
for (obj in params.inputObjects) {
def found = false;
for (existingObj in ctx._source.arrayOfObjects) {
if (obj.typ == existingObj.typ && obj.id == existingObj.id && isAfter(obj.updated, existingObj.updated)) {
existingObj.updated = obj.updated;
existingObj.value = obj.value;
found = true;
break;
}
}
if (!found) {
ctx._source.arrayOfObjects.add([
"typ": obj.typ,
"id": obj.id,
"value": params.inputValue,
"updated": obj.updated
]);
}
}
There's technically nothing suboptimal about your approach.
A HashMap could potentially save some time but since you're scripting, you're already bound to its innate inefficiencies... Btw here's how you initialize & work with HashMaps.
Another approach would be to rethink your data structure -- instead of arrays of objects use keyed objects or similar. Arrays of objects aren't great for frequent updates.
Finally a tip: you said that these fields are only used to store some intermediate state. If that weren't the case (or won't be in the future), I'd recommend using nested arrays to enable querying independently of other objects in the array.

how to create a join relation using elasticsearch python client

I am looking for any examples that implement the parent-child relationship using the python interface.
I can define a mapping such as
es.indices.create(
index= "docpage",
body= {
"mappings": {
"properties": {
"my_join_field": {
"type": "join",
"relations": {
"my_document": "my_page"
}
}
}
}
}
)
I am then indexing a document using
res = es.index(index="docpage",doc_type="_doc",id = 1, body=jsonDict) ,
where jsonDict is a dict structure of document's text,
jsonDict['my_join_field']= 'my_document', and other relevant info.
Reference example.
I tried adding pageDict where the page is a string containing text on a page in a document, and
pageDict['content']=page
pageDict['my_join_field']={}
pageDict['my_join_field']['parent']="1"
pageDict['my_join_field']['name']="page"
res = es.index(index="docpage",doc_type="_doc",body=pageDict)
but I get a parser error:
RequestError(400, 'mapper_parsing_exception', 'failed to parse')
Any ideas?
This worked for me :
res=es.index(index="docpage",doc_type="_doc",body={"content":page,
"my-join-field":{
"name": "my_page",
"parent": "1"}
})
The initial syntax can work if the parent is also repeated in the "routing" key of the main query body:
res = es.index(index="docpage",doc_type="_doc",body=pageDict, routing=1)

How to use mapping in elasticsearch?

After treating logs with logstash, All my fields have the same type 'STRING so i want to use mapping in elasticsearch to change some type like ip, port ect.. whereas i don't know how to do it, i'm a super beginner in ElasticSearch..
Any help ?
The first thing to do would be to install the Marvel plugin in Elasticsearch. It allows you to work with the Elasticsearch REST API very easily - to index documents, modify mappings, etc.
Go to the Elasticsearch folder and run:
bin/plugin -i elasticsearch/marvel/latest
Then go to http://localhost:9200/_plugin/marvel/sense/index.html to access Marvel Sense from which you can send commands. Marvel itself provides you with a dashboard about Elasticsearch indices, performance stats, etc.: http://localhost:9200/_plugin/marvel/
In Sense, you can run:
GET /_cat/indices
to learn what indices exist in your Elasticsearch instance.
Let's say there is an index called logstash.
You can check its mapping by running:
GET /logstash/_mapping
Elasticsearch will return a JSON document that describes the mapping of the index. It could be something like:
{
"logstash": {
"mappings": {
"doc": {
"properties": {
"Foo": {
"properties": {
"x": {
"type": "String"
},
"y": {
"type": "String"
}
}
}
}
}
}
}
}
...in this case doc is the document type (collection) in which you index documents. In Sense, you could index a document as follows:
PUT logstash/doc/1
{
"Foo": {
"x":"500",
"y":"200"
}
}
... that's a command to index the JSON object under the id 1.
Once a document field such as Foo.x has a type String, it cannot be changed to a number. You have to set the mapping first and then reindex.
First delete the index:
DELETE logstash
Then create the index and set the mapping as follows:
PUT logstash
PUT logstash/doc/_mapping
{
"doc": {
"properties": {
"Foo": {
"properties": {
"x": {
"type": "long"
},
"y": {
"type": "long"
}
}
}
}
}
}
Now, even if you index a doc with the properties as JSON strings, Elastisearch will convert them to numbers:
PUT logstash/doc/1
{
"Foo": {
"x":"500",
"y":"200"
}
}
Search for the new doc:
GET logstash/_search
Notice that the returned document, in the _source field, looks exactly the way you sent it to Elasticsearch - that's on purpose, Elasticsearch always preserves the original doc this way. The properties are indexed as numbers though. You can run a range query to confirm:
GET logstash/_search
{
"query":{
"range" : {
"Foo.x" : {
"gte" : 500
}
}
}
}
With respect to Logstash, you might want to set a mapping template for index name logstash-* since Logstash creates new indices automatically: http://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-templates.html

Resources