Search for documents with the same value in Elasticsearch

Search for documents with the same value in Elasticsearch - elasticsearch

I have a schema that looks something like this:
{
"mappings": {
"entity": {
"properties": {
"a": {
"type": "text"
},
"b": {
"type": "text"
}
}
}
I want to find all the values of b which have a value of a which is shared by 2 or more entities:
Querying against:
[{"a": "a1", "b": "b1"},
{"a": "a1", "b": "b2"},
{"a": "a2", "b": "b3"}]
Should return b1 and b2.

You can do a terms aggregation on the a field with a min_doc_count of 2 and then add a top_hits sub-aggregation to find the matching b fields:
{
"size": 0,
"aggs": {
"dups": {
"terms": {
"field": "a",
"min_doc_count": 2
},
"aggs": {
"b_hits": {
"top_hits": {
"_source": "b"
}
}
}
}
}
}

Related

Nested array of objects aggregation in Elasticsearch

Documents in the Elasticsearch are indexed as such
Document 1
{
"task_completed": 10
"tagged_object": [
{
"category": "cat",
"count": 10
},
{
"category": "cars",
"count": 20
}
]
}
Document 2
{
"task_completed": 50
"tagged_object": [
{
"category": "cars",
"count": 100
},
{
"category": "dog",
"count": 5
}
]
}
As you can see that the value of the category key is dynamic in nature. I want to perform a similar aggregation like in SQL with the group by category and return the sum of the count of each category.
In the above example, the aggregation should return
cat: 10,
cars: 120 and
dog: 5
Wanted to know how to write this aggregation query in Elasticsearch if it is possible. Thanks in advance.

You can achieve your required result, using nested, terms, and sum aggregation.
Adding a working example with index mapping, search query and search result
Index Mapping:
{
"mappings": {
"properties": {
"tagged_object": {
"type": "nested"
}
}
}
}
Search Query:
{
"size": 0,
"aggs": {
"resellers": {
"nested": {
"path": "tagged_object"
},
"aggs": {
"books": {
"terms": {
"field": "tagged_object.category.keyword"
},
"aggs":{
"sum_of_count":{
"sum":{
"field":"tagged_object.count"
}
}
}
}
}
}
}
}
Search Result:
"aggregations": {
"resellers": {
"doc_count": 4,
"books": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cars",
"doc_count": 2,
"sum_of_count": {
"value": 120.0
}
},
{
"key": "cat",
"doc_count": 1,
"sum_of_count": {
"value": 10.0
}
},
{
"key": "dog",
"doc_count": 1,
"sum_of_count": {
"value": 5.0
}
}
]
}
}
}

Bucket sort in composite aggregation?

How can I do Bucket Sort in composite Aggregation?
I need to do Composite Aggregation with Bucket sort.
I have tried Sort with aggregation.
I have tried composite aggregation.

I think this question, is in continuation to your previous question, so considered the same use case
You need to use Bucket sort aggregation that is a parent pipeline
aggregation which sorts the buckets of its parent multi-bucket
aggregation. And please refer to this documentation on composite
aggregation to know more about this.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings":{
"properties":{
"user":{
"type":"keyword"
},
"date":{
"type":"date"
}
}
}
}
Index Data:
{
"date": "2015-01-01",
"user": "user1"
}
{
"date": "2014-01-01",
"user": "user2"
}
{
"date": "2015-01-11",
"user": "user3"
}
Search Query:
The size parameter can be set to define how many composite buckets
should be returned. Each composite bucket is considered as a single
bucket, so setting a size of 10 will return the first 10 composite
buckets created from the values source. The response contains the
values for each composite bucket in an array containing the values
extracted from each value source. Defaults to 10.
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"size": 3, <-- note this
"sources": [
{
"product": {
"terms": {
"field": "user"
}
}
}
]
},
"aggs": {
"mySort": {
"bucket_sort": {
"sort": [
{
"sort_user": {
"order": "desc"
}
}
]
}
},
"sort_user": {
"min": {
"field": "date"
}
}
}
}
}
}
Search Result:
"aggregations": {
"my_buckets": {
"after_key": {
"product": "user3"
},
"buckets": [
{
"key": {
"product": "user3"
},
"doc_count": 1,
"sort_user": {
"value": 1.4209344E12,
"value_as_string": "2015-01-11T00:00:00.000Z"
}
},
{
"key": {
"product": "user1"
},
"doc_count": 1,
"sort_user": {
"value": 1.4200704E12,
"value_as_string": "2015-01-01T00:00:00.000Z"
}
},
{
"key": {
"product": "user2"
},
"doc_count": 1,
"sort_user": {
"value": 1.3885344E12,
"value_as_string": "2014-01-01T00:00:00.000Z"
}
}
]
}

Elasticsearch aggregation over children document field values

I'm facing the following problem of selecting and sorting parent documents based on an aggregated value over its children documents. The aggregation (e.g. sum) itself depends on a query string, i.e. which children documents are relevant for the aggregation.
Example: Given the documents basket A and basket B, for each basket document, I am looking to sum over the number field of its fruit children if the name field matches my query, e.g. apples.
PUT /baskets/_doc/0
{
"name": "basket A",
"fruit": [
{
"name": "apples",
"number": 2
},
{
"name": "oranges",
"number": 3
}
]
}
PUT /baskets/_doc/1
{
"name": "basket B",
"fruit": [
{
"name": "apples",
"number": 3
},
{
"name": "apples",
"number": 3
}
]
}
Mappings:
PUT /baskets
{
"mappings": {
"properties": {
"name": { "type": "text" },
"fruit": {
"type": "nested",
"properties": {
"name": { "type": "text" },
"number": { "type": "long" }
}
}
}
}
}
Use case 1: Which basket has (strictly) more than 5 apples? Would expect only basket B
Use case 2: Sort baskets by number of apples. Would expect basket B with a total of 6 apples, then basket A with a total of 2 apples.
How can one implement this using the Elasticsearch (7.8.0) query DSL?
I have tried so far with nested queries and aggregations without success.
Thanks!
Edit: Added mappings
Edit: Updated the numbers to better reflect the problem
*Edit: Added possible answer to Use case 2 (see comments to the answer from #joe):
GET /profiles/_search
{
"aggs": {
"aggs_baskets": {
"terms": {
"field": "name",
"order": {"nest > fruit_filter > fruit_sum": "desc"}
},
"aggs": {
"nest":{
"nested":{
"path": "fruit"
},
"aggs":{
"fruit_filter":{
"filter": {
"term": {"fruit.name": "apple"}
},
"aggs":{
"fruit_sum":{
"sum": {"field": "fruit.number"}
}
}
}
}
}
}
}
}
}

Use case 1:
GET baskets/_search
{
"query": {
"nested": {
"path": "fruit",
"inner_hits": {},
"query": {
"bool": {
"must": [
{
"term": {
"fruit.name": {
"value": "apples"
}
}
},
{
"range": {
"fruit.number": {
"gte": 5
}
}
}
]
}
}
}
}
}
Strictly more than 5 --> gt; >=5 --> gte.
Also notice the inner_hits part -- this gives you the actual nested subdocument which caused this particular basket to match the query. It's not required but good-to-know.
Use case 2:
GET baskets/_search
{
"sort": [
{
"fruit.number": {
"nested_path": "fruit",
"order": "desc"
}
}
]
}
Use case 2 Edit:
There are probably cleaner ways of doing this but I'd go with the following:
GET baskets/_search
{
"size": 0,
"aggs": {
"multiply_and_add": {
"scripted_metric": {
"params": {
"only_fruit_name": "apples"
},
"init_script": "state.by_basket_name = [:]",
"map_script": """
def basket_name = params._source['name'];
def fruits = params._source['fruit'].findAll(group -> group.name == params.only_fruit_name);
for (def fruit_group : fruits) {
def number = fruit_group.number;
if (state.by_basket_name.containsKey(basket_name)) {
state.by_basket_name[basket_name] += number;
} else {
state.by_basket_name[basket_name] = number;
}
}
""",
"combine_script": "return state.by_basket_name",
"reduce_script": "return states"
}
}
}
}
yielding a hash map along the lines of
{
...
"aggregations":{
"multiply_and_add":{
"value":[
{
"basket A":2,
"basket B":6
}
]
}
}
}
Sorting can either be done in the reduce_script or within your ES response post-processing pipeline. You could of course choose to go w/ (sorted) lists and lambdas...
Notice the required nested_path.

After a while of searching and testing, here are (in addition to #joe's answer to use case 2) possible queries for both use cases. Note that both use cases require to change the mapping for the field name to be of type keyword.
Use case 1: Which basket has (strictly) more than 5 apples? Would expect only basket B
For more information on filtering results by their aggregation value see Bucket Selectors
GET /baskets/_search
{
"aggs": {
"aggs_baskets": {
"terms": {
"field": "name"
},
"aggs": {
"nest":{
"nested":{
"path": "fruit"
},
"aggs":{
"fruit_filter":{
"filter": {
"match": {"fruit.name": "apples"}
},
"aggs":{
"fruit_sum":{
"sum": {"field": "fruit.number"}
}
}
}
}
},
"basket_sum_filter":{
"bucket_selector":{
"buckets_path":{
"fruitSum":"nest > fruit_filter > fruit_sum"
},
"script":"params.fruitSum > 5"
}
}
}
}
}
}
... yielding
...,
"buckets": [
{
"key": "basket B",
"doc_count": 1,
"nest": {
"doc_count": 2,
"fruit_filter": {
"doc_count": 2,
"fruit_sum": {
"value": 6
}
}
}
}
]
Use case 2: Sort baskets by number of apples. Would expect basket B with a total of 6 apples, then basket A with a total of 2 apples.
GET /baskets/_search
{
"aggs": {
"aggs_baskets": {
"terms": {
"field": "name",
"order": {"nest > fruit_filter > fruit_sum": "desc"}
},
"aggs": {
"nest":{
"nested":{
"path": "fruit"
},
"aggs":{
"fruit_filter":{
"filter": {
"term": {"fruit.name": "apple"}
},
"aggs":{
"fruit_sum":{
"sum": {"field": "fruit.number"}
}
}
}
}
}
}
}
}
}
... yielding
...,
"buckets": [
{
"key": "basket B",
"doc_count": 1,
"nest": {
"doc_count": 2,
"fruit_filter": {
"doc_count": 2,
"fruit_sum": {
"value": 6
}
}
}
},
{
"key": "basket A",
"doc_count": 1,
"nest": {
"doc_count": 2,
"fruit_filter": {
"doc_count": 1,
"fruit_sum": {
"value": 2
}
}
}
}
]

ElasticSearch aggregation query with List in documents

I have following records of car sales of different brands in different cities.
Document -1
{
"city": "Delhi",
"cars":[{
"name":"Toyota",
"purchase":100,
"sold":80
},{
"name":"Honda",
"purchase":200,
"sold":150
}]
}
Document -2
{
"city": "Delhi",
"cars":[{
"name":"Toyota",
"purchase":50,
"sold":40
},{
"name":"Honda",
"purchase":150,
"sold":120
}]
}
I am trying to come up with query to aggregate car statistics for a given city but not getting the right query.
Required result:
{
"city": "Delhi",
"cars":[{
"name":"Toyota",
"purchase":150,
"sold":120
},{
"name":"Honda",
"purchase":350,
"sold":270
}]
}

First you need to map your array as a nested field (script would be complicated and not performant). Nested field are indexed, aggregation will be pretty fast.
remove your index / or create a new one. Please note i use test as type.
{
"mappings": {
"test": {
"properties": {
"city": {
"type": "keyword"
},
"cars": {
"type": "nested",
"properties": {
"name": {
"type": "keyword"
},
"purchase": {
"type": "integer"
},
"sold": {
"type": "integer"
}
}
}
}
}
}
}
Index your document (same way you did)
For the aggregation:
{
"size": 0,
"aggs": {
"avg_grade": {
"terms": {
"field": "city"
},
"aggs": {
"resellers": {
"nested": {
"path": "cars"
},
"aggs": {
"agg_name": {
"terms": {
"field": "cars.name"
},
"aggs": {
"avg_pur": {
"sum": {
"field": "cars.purchase"
}
},
"avg_sold": {
"sum": {
"field": "cars.sold"
}
}
}
}
}
}
}
}
}
}
result:
buckets": [
{
"key": "Honda",
"doc_count": 2,
"avg_pur": {
"value": 350
},
"avg_sold": {
"value": 270
}
}
,
{
"key": "Toyota",
"doc_count": 2,
"avg_pur": {
"value": 150
},
"avg_sold": {
"value": 120
}
}
]
if you have index the name / city field as a text (you have to ask first if this is necessary), use .keyword in the term aggregation ("cars.name.keyword").

Elasticsearch aggregation by field name

Imagine two documents:
[
{
"_id": "abc",
"categories": {
"category-id-1": 1,
"category-id-2": 50
}
},
{
"_id": "def",
"categories": {
"category-id-1": 2
}
}
]
As you can see, each document can be associated with a number of categories, by setting a nested field into the categories field.
With this mapping, I should be able to request the documents from a defined category and to order them by the value set as value for this field.
My problem is that I now want to make an aggregation to count for each category the number of documents. That would give the following result for the dataset I provided:
{
"aggregations": {
"categories" : {
"buckets": [
{
"key": "category-id-1",
"doc_count": 2
},
{
"key": "category-id-2",
"doc_count": 1
}
]
}
}
}
I can't find anything in the documentation to solve this problem. I'm completely new to ElasticSearch so I may be doing something wrong either on my documentation research or on my mapping choice.
Is it possible to make this kind of aggregation with my mapping? I'm using ES 6.x
EDIT: Here is the mapping for the index:
{
"test1234": {
"mappings": {
"_doc": {
"properties": {
"categories": {
"properties": {
"category-id-1": {
"type": "long"
},
"category-id-2": {
"type": "long"
}
}
}
}
}
}
}
}

The most straightforward solution is to use a new field that contains all the distinct categories of a document.
If we call this field categories_list here could be a solution :
Change the mapping to
{
"test1234": {
"mappings": {
"_doc": {
"properties": {
"categories": {
"properties": {
"category-id-1": {
"type": "long"
},
"category-id-2": {
"type": "long"
}
}
},
"categories_list": {
"type": "keyword"
}
}
}
}
}
}
Then you need to modify your documents like this :
[
{
"_id": "abc",
"categories": {
"category-id-1": 1,
"category-id-2": 50
},
"categories_list": ["category-id-1", "category-id-2"]
},
{
"_id": "def",
"categories": {
"category-id-1": 2
},
"categories_list": ["category-id-1"]
}
]
then your aggregation request should be
{
"aggs": {
"categories": {
"terms": {
"field": "categories_list",
"size": 10
}
}
}
}
and will return
"aggregations": {
"categories": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "category-id-1",
"doc_count": 2
},
{
"key": "category-id-2",
"doc_count": 1
}
]
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Search for documents with the same value in Elasticsearch - elasticsearch

You can do a terms aggregation on the a field with a min_doc_count of 2 and then add a top_hits sub-aggregation to find the matching b fields: { "size": 0, "aggs": { "dups": { "terms": { "field": "a", "min_doc_count": 2 }, "aggs": { "b_hits": { "top_hits": { "_source": "b" } } } } } }

Related

Nested array of objects aggregation in Elasticsearch

Bucket sort in composite aggregation?

Elasticsearch aggregation over children document field values

ElasticSearch aggregation query with List in documents

Elasticsearch aggregation by field name

Categories

Resources