How to using aggregations with composite for elasticsearch-dsl - elasticsearch

i am using aggregations and aggregation bucket accept one key value as default then i research and find it
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{ "category_pk": { "terms": { "field": "category.pk"} } },
{ "category_name": { "terms": {"field": "category.name" } } }
]
}
}
}
}
Above code result two keys , and _doc_count but i can't apply for elasticsearch-dsl
Someone help me
Thanks

I solved the problem,When We are using Composite
s = ProductDocument.search()
brand_name = A('terms', field='brand.name')
brand_pk = A('terms', field='brand.id')
brand_key_aggs = [
{'brand_pk': brand_pk},
{'brand_name': brand_name}
]
s.aggs.bucket('brand_terms', "composite", sources=brand_key_aggs)
Example Result
{
'key':{
'brand_pk':869,
'brand_name':'Uni Baby'
},
'doc_count':2
},

Related

ElasticSearch 6.7 painless, how to access nested document

When I use ES 5.5 update to 6.7.
Painless script does’t work
This is 5.5
If I want get a nested document [transFilter]
I do this
params['_source’]['carFilter’]
It works very well。
But
When I used 6.7 version
params['_source’]['carFilter’]
I found it does’t work
All params['_source’] is null
my mappings
carFilter": {
"type": "nested",
"properties": {
"time": {
"type": "long"
}
}
}
my data example
"carFilter" : [
{
"time" : 20200120
},
{
"time" : 20200121
}
]
and my query script example
{
"query" : {
"bool" : {
"must" : [
{
"script" : {
"script" : {
"inline" : "if(params['_source']!=null){
if(params['_source']['carFilter']!=null){
for(def item:params['_source']['carFilter'] ){
if (item.time>1) { return true; }
}
}
}
return false;",
"lang" : "painless",
"params" : {
"rentTime" : 1000
}
}
}
}
]
}
}
}
even no error
but fact
if(params['_source']!=null){
this line already return
The simple painless above is just to illustrate the problem, and a relatively real one is attached below.
double carPrice=0.00;if(!params['_source'].empty){"+
" def days=params['_source']['everyDayPrice'];if(params['_source']['everyDayPrice']!=null){int size=days.length;" +
" if(size>0){for(int i=0;i<size;i++){String day = days[i]['day'];Double price = days[i]['price'];"+
" if(price!=null&&params.get(day)!=null){carPrice=carPrice+params.get(day)*price;}}}}}" +
" return carPrice/params.total"
Looking at your query, you would want to filter the documents having carFilter.time > 1 and why not use a simple Nested Query:
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "carFilter",
"query": {
"range": {
"carFilter.time": {
"gte": 1
}
}
}
}
}
]
}
}
}
Note that I've made use of Range Query to evaluate the time based on what you are looking for.
I'd suggest you go through this answer if the above doesn't help.
Let me know if you have any queries.

Elasticsearch partial update based on Aggregation result

I want to update partially all objects that are based on aggregation result.
Here is my object:
{
"name": "name",
"identificationHash": "aslkdakldjka",
"isDupe": false,
...
}
My goal is to set isDupe to "true" for all documents where "identificationHash" is there more than 2 times.
Currently what I'm doing is:
I get all the documents that "isDupe" = false with a Term aggregation on "identificationHash" for a min_doc_count of 2.
{
"query": {
"bool": {
"must": [
{
"term": {
"isDupe": {
"value": false,
"boost": 1
}
}
}
]
}
},
"aggregations": {
"identificationHashCount": {
"terms": {
"field": "identificationHash",
"size": 10000,
"min_doc_count": 2
}
}
}
}
With the aggregation result, I do a bulk update with a script where "ctx._source.isDupe=true" for all identificationHash that match my aggregation result.
I repeat step 1 and 2 until there is no more result from the aggregation query.
My question is: Is there a better solution to that problem? Can I do the same thing with one script query without looping with batch of 1000 identification hash?
There's no solution that I know of that allows you to do this in on shot. However, there's a way to do it in two steps, without having to iterate over several batches of hashes.
The idea is to first identify all the hashes to be updated using a feature called Transforms, which is nothing else than a feature that leverages aggregations and builds a new index out of the aggregation results.
Once that new index has been created by your transform, you can use it as a terms lookup mechanism to run your update by query and update the isDupe boolean for all documents having a matching hash.
So, first, we want to create a transform that will create a new index featuring documents containing all duplicate hashes that need to be updated. This is achieved using a scripted_metric aggregation whose job is to identify all hashes occurring at least twice and for which isDupe: false. We're also aggregating by week, so for each week, there's going to be a document containing all duplicates hashes for that week.
PUT _transform/dup-transform
{
"source": {
"index": "test-index",
"query": {
"term": {
"isDupe": "false"
}
}
},
"dest": {
"index": "test-dups",
"pipeline": "set-id"
},
"pivot": {
"group_by": {
"week": {
"date_histogram": {
"field": "lastModifiedDate",
"calendar_interval": "week"
}
}
},
"aggregations": {
"dups": {
"scripted_metric": {
"init_script": """
state.week = -1;
state.hashes = [:];
""",
"map_script": """
// gather all hashes from each shard and count them
def hash = doc['identificationHash.keyword'].value;
// set week
state.week = doc['lastModifiedDate'].value.get(IsoFields.WEEK_OF_WEEK_BASED_YEAR).toString();
// initialize hashes
if (!state.hashes.containsKey(hash)) {
state.hashes[hash] = 0;
}
// increment hash
state.hashes[hash] += 1;
""",
"combine_script": "return state",
"reduce_script": """
def hashes = [:];
def week = -1;
// group the hash counts from each shard and add them up
for (state in states) {
if (state == null) return null;
week = state.week;
for (hash in state.hashes.keySet()) {
if (!hashes.containsKey(hash)) {
hashes[hash] = 0;
}
hashes[hash] += state.hashes[hash];
}
}
// only return the hashes occurring at least twice
return [
'week': week,
'hashes': hashes.keySet().stream().filter(hash -> hashes[hash] >= 2)
.collect(Collectors.toList())
]
"""
}
}
}
}
}
Before running the transform, we need to create the set-id pipeline (referenced in the dest section of the transform) that will define the ID of the target document that is going to contain the hashes so that we can reference it in the terms query for updating documents:
PUT _ingest/pipeline/set-id
{
"processors": [
{
"set": {
"field": "_id",
"value": "{{dups.week}}"
}
}
]
}
We're now ready to start the transform to generate the list of hashes to update and it's as simple as running this:
POST _transform/dup-transform/_start
When it has run, the destination index test-dups will contain one document that looks like this:
{
"_index" : "test-dups",
"_type" : "_doc",
"_id" : "44",
"_score" : 1.0,
"_source" : {
"week" : "2021-11-01T00:00:00.000Z",
"dups" : {
"week" : "44",
"hashes" : [
"12345"
]
}
}
},
Finally, we can run the update by query as follows (add as many terms queries as weekly documents in the target index):
POST test/_update_by_query
{
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{
"terms": {
"identificationHash": {
"index": "test-dups",
"id": "44",
"path": "dups.hashes"
}
}
},
{
"terms": {
"identificationHash": {
"index": "test-dups",
"id": "45",
"path": "dups.hashes"
}
}
}
]
}
},
"script": {
"source": "ctx._source.isDupe = true;"
}
}
That's it in two simple steps!! Try it out and let me know.

Search for documents with exactly different fields values

I'm adding documents with the following strutucte
{
"proposta": {
"matriculaIndicacao": 654321,
"filial": 100,
"cpf": "12345678901",
"idStatus": "3",
"status": "Reprovada",
"dadosPessoais": {
"nome": "John Five",
"dataNascimento": "1980-12-01",
"email": "fulanodasilva#fulano.com.br",
"emailValidado": true,
"telefoneCelular": "11 99876-9999",
"telefoneCelularValidado": true,
"telefoneResidencial": "11 2211-1122",
"idGenero": "1",
"genero": "M"
}
}
}
I'm trying to perform a search with multiple field values.
I can successfull search for a document with a specific cpf atribute with the following search
{
"query": {
"term" : {
"proposta.cpf" : "23798770823"
}
}
}
But now I need to add an AND clause, like
{
"query": {
"term" : {
"proposta.cpf" : "23798770823"
,"proposta.dadosPessoais.dataNascimento": "1980-12-01"
}
}
}
but it's returning an error message.
P.S: If possible I would like to perform a search where if the field doesn't exist, it returns the document that matches only the proposta.cpf field.
I really appreciate any help.
The idea is to combine your constraints within a bool/should query
{
"query": {
"bool": {
"should": [
{
"term": {
"proposta.cpf": "23798770823"
}
},
{
"term": {
"proposta.dadosPessoais.dataNascimento": "1980-12-01"
}
}
]
}
}
}

Return distinct values in Elasticsearch

I am trying to solve an issue where I have to get distinct result in the search.
{
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "GEORGE",
"favorite_cars" : [ "honda","Hyundae" ]
}
When I perform a term query on favourite cars "ferrari". I get two results whose name is ABC. I simply want that the result returned should be one in this case. So my requirement will be if I can apply a distinct on name field to receive one 1 result.
Thanks
One way to achieve what you want is to use a terms aggregation on the name field and then a top_hits sub-aggregation with size 1, like this:
{
"size": 0,
"query": {
"term": {
"favorite_cars": "ferrari"
}
},
"aggs": {
"names": {
"terms": {
"field": "name"
},
"aggs": {
"single_result": {
"top_hits": {
"size": 1
}
}
}
}
}
}
That way, you'll get a single term ABC and then nested into it a single matching document

Elastic search aggregation with range query

I am working to build a ES query that satisfies the condition >= avg .
Here is an example:
GET /_search
{
"size" : 0,
"query" : {
"filtered": {
"filter": {
"range": {
"price": {
"gte": {
"aggs" : {
"single_avg_price": {
"avg" :{
"field" : "price"
}
}
}
}
}
}
}
}
}
}
I get the following error
"type": "query_parsing_exception",
"reason": "[range] query does not support [aggs]",
I wonder how do we use aggregated value with range query in Elastic query
You cannot embed aggregations inside a query. You need to first send an aggregation query to find out the average and then send a second range query using the obtained average value.
Query 1:
POST /_search
{
"size": 0,
"aggs": {
"single_avg_price": {
"avg": {
"field": "price"
}
}
}
}
Then you get the average price, say it was 12.3 and use it in your second query, like this:
Query 2:
POST /_search
{
"size": 10,
"query": {
"filtered": {
"filter": {
"range": {
"price": {
"gte": 12.3
}
}
}
}
}
}
After I tried using different ES aggregations such as bucket selector , I found that it can be done using python.
Here is the python code I created to solve this issue.
Please note: URL , USER_NAME , PASSWORD needs to be filled before run it.
#! /usr/bin/python
import sys,json,requests
from requests.auth import HTTPBasicAuth
# static variables
URL=''
USER_NAME=''
PASSWORD=''
# returns avg value
def getAvg():
query = json.dumps({
"aggs": {
"single_avg_price": {
"avg": {
"field": "price"
}
}
}
})
response = requests.get(URL,auth=HTTPBasicAuth(USER_NAME,PASSWORD), data=query)
results = json.loads(response.text)
return results['aggregations']['single_avg_price']['value']
#returns rows that are greater than avg value
def rows_greater_than_avg(avg_value):
query = json.dumps({
"query" : {
"range": {
"price": {
"gte":avg_value
}
}
}
})
response = requests.get(URL,auth=HTTPBasicAuth(USER_NAME,PASSWORD), data=query)
results = json.loads(response.text)
return results
# main method
def main():
avg_value = getAvg()
print( rows_greater_than_avg(avg_value))
main()

Resources