Dedup elasticsearch results using multiple fields as unique key - elasticsearch

There have been similar question asked to this (see Remove duplicate documents from a search in Elasticsearch) but I haven't found a way to dedup using multiple fields as the "unique key". Here's a simple example to illustrate a bit of what I'm looking for:
Say this is our raw data:
{ "name": "X", "event": "A", "time": 1 }
{ "name": "X", "event": "B", "time": 2 }
{ "name": "X", "event": "B", "time": 3 }
{ "name": "Y", "event": "A", "time": 4 }
{ "name": "Y", "event": "C", "time": 5 }
I would essentially like to get the distinct event counts based on name and event. I want to avoid double counting the event B which happened on the same name X twice, so the counts I'd be looking for are:
event: A, count: 2
event: B, count: 1
event: C, count: 1
Is there a way to set up an agg query as seen in the related question? Another option I've deliberated is to index the object with a special key field (i.e. "X_A", "X_B", etc.). I could then simply dedup on this field. I'm not sure which is a preferred approach, but I'd personally prefer not to index the data with extra metadata.

You can specify a script in a terms aggregation in order to build a key out of multiple fields:
POST /test/dedup/_search
{
"aggs":{
"dedup" : {
"terms":{
"script": "[doc.name.value, doc.event.value].join('_')"
},
"aggs":{
"dedup_docs":{
"top_hits":{
"size":1
}
}
}
}
}
}
This will basically provide the following results:
X_A: 1
X_B: 2
Y_A: 1
Y_C: 1
Note: There's only one event C in your sample data, so the count cannot be two unless I'm missing something.

Related

Elasticsearch to return documents based on 2 criteria where one is based on the other

I have documents in the following format:
{
"id": number
"chefId: number
"name": String,
"ingredients": List<String>,
"isSpecial": boolean
}
Here is a list of 5 documents:
{
"id": 1,
"chefId": 1,
"name": "Roasted Potatoes",
"ingredients": ["Potato", "Onion", "Oil", "Salt"],
"isSpecial": false
},
{
"id": 2,
"chefId": 1,
"name": "Dauphinoise potatoes",
"ingredients": ["Potato", "Garlic", "Cream", "Salt"],
"isSpecial": true
},
{
"id": 3,
"chefId": 2,
"name": "Boiled Potatoes",
"ingredients": ["Potato", "Salt"],
"isSpecial": true
},
{
"id": 4,
"chefId": 3
"name": "Mashed Potatoes",
"ingredients": ["Potato", "Butter", "Milk"],
"isSpecial": false
},
{
"id": 5,
"chefId": 4
"name": "Hash Browns",
"ingredients": ["Potato", "Onion", "Egg"],
"isSpecial": false
}
I will be doing a search where "Potatoes" is contained in the name field. Like this:
{
"query": {
"wildcard": {
"status": {
"value": "*Potatoes*"
}
}
}
}
But I also want to add some extra criteria when returning documents:
If the ingredients contain onion or milk, then return the documents. So documents with the id 1 and 4 will be returned. Note that this means that we have documents returned where chef ids are 1 and 3.
Then, for the documents where we haven't already got another document with the same chef id, return where the isSpecial flag is set to true. So only document 3 will be returned. 2 wouldn't be returned as we already have a document where the chef id is equal to one.
Is it possible to do this kind of chaining in Elasticsearch? I would like to be able to do this in a single query so that I can avoid adding logic to my (Java) code.
You can't have that sort of logic in one elasticsearch query. You could have a tricky query with aggregations / post_filter and so to have all the data you need in one query and then transform it in your Java application.
But the best approach (and the more maintainable) is to have two queries.

Separate multiple events in logstash input into separate documents in elasticsearch index

INPUT in logstash :
{
"Teacher": {
"Name": "Mary",
"age": 20,
},
"Student": [
{
"Name": "Tim",
"age"12
},
{
"Name": "Eric",
"age":13
}
]
}
Need to filter this input using logstash to send three separate documents into ElasticSearch.
doc1: {
"Name": "ABC",
"age": 20,
}
doc2: {
"Name": "Tim",
"age"12
}
doc 3:
{
"Name": "Eric",
"age":13
}
Tried split, mutate, ruby filters function but did not get the desired result. Could someone help me separate these into separate outputs to the elasticsearch index.
Since you want a separate event for 'Mary', use the clone filter to create two events. Delete the 'Students' array from one copy to just be left with 'Mary'.
In the second clone, using the split filter will give you different events for 'Tim' and 'Eric'.

Update part of document in Elasticsearch from Kafka

I have multiple Kafka Connectors and Topics that all house different sources of data, yet all contain reference to the same primary key (lets call "id"). Can you update Elasticsearch using this same id?
For example, source 1 has the following schema
{
"id": 123
"some_value": "yo"
"details": {}
}
Source 2 has the following
{
"id": 123
"reference": 1
},
{
"id": 123
"reference": 2
}
Is there a way I can create my expected outcome within ES to mimic the following
{
"id": 123
"some_value": "yo"
"details": [
{
"id": 123
"reference": 1
},
{
"id": 123
"reference": 2
}
]
}
I have tried using Kafka's transforms with hoistfield but have been unsuccessful

Update a subdocument list object value in RethinkDB

{
"id": 1,
"subdocuments": [
{
"id": "A",
"name": 1
},
{
"id": "B",
"name": 2
},
{
"id": "C",
"name": 3
}
]
}
How do update a subdocument "A"s "name" to a value of 2 in RethinkDB in either Javascript or Python?
If you can rely of the position of your "A " element you can update like this:
r.db("DB").table("TABLE").get(1)
.update({subdocuments:
r.row("subdocuments").changeAt(0, r.row("subdocuments").nth(0).merge({"name":2}))})
If you can not rely on the position, you have to find it yourself:
r.db("DB").table("TABLE").get(1).do(function(doc){
return doc("subdocuments").offsetsOf(function(sub){return sub("id").match("A")}).nth(0)
.do(function(index){
return r.db("DB").table("TABLE").update({"subdocuments":
doc("subdocuments").changeAt(index, doc("subdocuments").nth(index).merge({"name":2})) })})
})
As an alternative you can use the map function to iterate over the array elements and update the one that matches your condition
r.db("DB").table("TABLE").get(1)
.update({
subdocuments: r.row("subdocuments").map(function(sub){
return r.branch(sub("id").eq("A"), sub.merge({name: 2}), sub)
})
})

JMESPath current array index

In JMESPath with this query:
people[].{"index":#.index,"name":name, "state":state.name}
On this example data:
{
"people": [
{
"name": "a",
"state": {"name": "up"}
},
{
"name": "b",
"state": {"name": "down"}
},
{
"name": "c",
"state": {"name": "up"}
}
]
}
I get:
[
{
"index": null,
"name": "a",
"state": "up"
},
{
"index": null,
"name": "b",
"state": "down"
},
{
"index": null,
"name": "c",
"state": "up"
}
]
How do I get the index property to actually have the index of the array? I realize that #.index is not the correct syntax but have not been able to find a function that would return the index. Is there a way to include the current array index?
Use-case
Use Jmespath query syntax to extract the numeric index of the current array element, from a series of array elements.
Pitfalls
As of this writing (2019-03-22) this feature is not a part of the standard Jmespath specification.
Workaround
This is possible when running Jmespath from within any of various programming languages, however this must be done outside of Jmespath.
This is not exactly the form you requested but I have a possible answer for you:
people[].{"name":name, "state":state.name} | merge({count: length(#)}, #[*])
this request give this result:
{
"0": {
"name": "a",
"state": "up"
},
"1": {
"name": "b",
"state": "down"
},
"2": {
"name": "c",
"state": "up"
},
"count": 3
}
So each attribute of this object have a index except the last one count it just refer the number of attribute, so if you want to browse the attribute of the object with a loop for example you can do it because you know that the attribute count give the number of attribute to browse.

Resources