Specify Elasticsearch aggregation fields when finding duplicates - elasticsearch

I am using the following ES query when looking for duplicates:
"aggs": {
"duplicates": {
"terms": {
"field": "phone",
"min_doc_count": 2,
"size": 99999,
"order": {
"_term": "asc"
}
},
"aggs": {
"_docs": {
"top_hits": {
"size": 99999
}
}
}
}
}
It works well, it returns the key which in this case is the phone, and inside of it it returns all the matches. The main problem is exactly that, on the _source it brings everything, which is a lot of fields on my case, and I wanted to specify to bring only the ones I need. Example of what's returning:
"duplicates": {
"1": {
"key": "1",
"doc_count": 2,
"_docs": {
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "local:company_id:1:sync",
"_type": "leads",
"_id": "23",
"_score": 1,
"_source": {
"id": 23,
"phone": 123456,
"areacode_id": 426,
"areacode_state_id": 2,
"firstName": "Brayan",
"lastName": "Rastelli",
"state": "", // .... and so on
I want to specify the fields that will be returned on the _source, is that possible?
Another problem that I'm having is that I want to order the aggregation results by a specific field (by id) but if I put any field name instead of _term it gives me an error.
Thank you!

In the below example, documents with id 29 and 23 have the same phone, hence these are duplicates. The search query will show only two fields i.e id and phone (you can change these fields according to your condition) and sort the top hits result on the basis of id
Adding a working example with index data, search query, and search result
Index Data:
{
"id": 29,
"phone": 123456,
"areacode_id": 426,
"areacode_state_id": 2,
"firstName": "Brayan",
"lastName": "Rastelli",
"state": ""
}
{
"id": 23,
"phone": 123456,
"areacode_id": 426,
"areacode_state_id": 2,
"firstName": "Brayan",
"lastName": "Rastelli",
"state": ""
}
{
"id": 30,
"phone": 1235,
"areacode_id": 92,
"areacode_state_id": 10,
"firstName": "Mark",
"lastName": "Smith",
"state": ""
}
Search Query:
{
"size": 0,
"aggs": {
"duplicates": {
"terms": {
"field": "phone",
"min_doc_count": 2,
"size": 99999
},
"aggs": {
"_docs": {
"top_hits": {
"_source": {
"includes": [
"phone",
"id"
]
},
"sort": [
{
"id": {
"order": "asc"
}
}
]
}
}
}
}
}
}
Search Result:
"aggregations": {
"duplicates": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 123456,
"doc_count": 2,
"_docs": {
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "66896259",
"_type": "_doc",
"_id": "1",
"_score": null,
"_source": {
"phone": 123456,
"id": 23
},
"sort": [
23 // note this
]
},
{
"_index": "66896259",
"_type": "_doc",
"_id": "2",
"_score": null,
"_source": {
"phone": 123456,
"id": 29
},
"sort": [
29 // note this
]
}
]
}
}
}
]
}
}

Related

Elastic Search Filter by comparing same field from different documents

In my index, I have documents like this:
{
"name": "name",
"createdAt": 1.6117508295E12
}
{
"name": "name1",
"createdAt": 1.6117508296E12
}
{
"name": "name",
"createdAt": 1.6117508297E12
}
I want to write a query in such a way so that I can compare between between the name field between any 2 documents and get unique results. The result should be like this:
{
"name": "name1",
"createdAt": 1.6117508296E12
}
{
"name": "name",
"createdAt": 1.6117508297E12
}
I am also using from and size in my elastic query.
I have tried using collapse but that gives me less number of results as per the size.
I am using elastic 7.15.2
You can simply use the terms aggregation with top_hits(with size=1, sorted by createdAt). Below is the working query on sample data your provided.
{
"size": 0,
"aggs": {
"unique": {
"terms": {
"field": "name.keyword"
},
"aggs": {
"unique_names": {
"top_hits": {
"sort": [
{
"createdAt": {
"order": "asc"
}
}
],
"size": 1
}
}
}
}
}
}
And search result
"aggregations": {
"unique": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "name",
"doc_count": 2,
"unique_names": {
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "71625371",
"_id": "1",
"_score": null,
"_source": {
"name": "name",
"createdAt": 1.6117508295E12
},
"sort": [
1.61175083E12
]
}
]
}
}
},
{
"key": "name1",
"doc_count": 1,
"unique_names": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "71625371",
"_id": "2",
"_score": null,
"_source": {
"name": "name1",
"createdAt": 1.6117508296E12
},
"sort": [
1.61175083E12
]
}
]
}
}
}
]
}
}

Deduplicate results in elasticsearch based on a field

I have an elasticsearch index (v6.8) that contains documents that may share a similar value for a field.
[
{
"siren": 123,
"owner": "A",
"price": 10
},
{
"siren": 123,
"owner": "B",
"price": 20
},
{
"siren": 456,
"owner": "A",
"price": 10
},
{
"siren": 456,
"owner": "C",
"price": 30
}
]
I would like to get all documents from owner A and B, but deduplicated on the siren field. The result would be. I don't care which deduplicated line is returned (from owner A or B).
[
{
"siren": 123,
"owner": "A",
"price": 10
},
{
"siren": 456,
"owner": "A",
"price": 10
}
]
Also, I would like my aggregations to count documents deduplicated on the same field.
I have tried
{
"query": {
"bool": {
"must": [
[
{
"terms": {
"owner": [
"A",
"B"
]
}
}
]
]
}
},
"aggs": {
"by_price": {
"terms": {
"field": "price",
"size": 20
}
}
}
}
But this counts multiple times the "same" document.
You can use terms aggregation on the siren field along with top hits aggregation
{
"size":0,
"query": {
"bool": {
"must": [
{
"terms": {
"owner.keyword": [
"A",
"B"
]
}
}
]
}
},
"aggs": {
"by_price": {
"terms": {
"field": "siren",
"size": 20
},
"aggs": {
"top_sales_hits": {
"top_hits": {
"_source": {
"includes": [
"siren",
"owner",
"price"
]
},
"size": 1
}
}
}
}
}
}
Search Result will be
"aggregations": {
"by_price": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 123,
"doc_count": 2,
"top_sales_hits": {
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "66226467",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"owner": "A", // note this
"siren": 123,
"price": 10
}
}
]
}
}
},
{
"key": 456,
"doc_count": 1,
"top_sales_hits": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "66226467",
"_type": "_doc",
"_id": "3",
"_score": 1.0,
"_source": {
"owner": "A", // note this
"siren": 456,
"price": 10
}
}
]
}
}
}
]
}
}

How to make aggregations work for text fields

I am trying to write a elasticsearch query to get unique locality towns. my locality_town_keyword is of keyword type. when I try to search into locality_town_keyword, I get search hits but nothing in "aggregations":"Buckets".
Following is how my schema looks like...
"locality_town": {
"type": "text"
},
"locality_town_keyword": {
"type": "keyword"
},
My Search query looks like following
{
"query":
{
"prefix" : { "locality_town" : "m" }
},
"size": "1",
"_source": {
"includes": [
"locality_town*"
]
},
"aggs": {
"loc": {
"terms": {
"field": "locality_town_keyoword",
"size": 5,
"order": {
"_count": "desc"
}
}
}
}
}
Here is the output it gives
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 799,
"max_score": 1.0,
"hits": [
{
"_index": "tenderindex_2",
"_type": "tender_2",
"_id": "290077",
"_score": 1.0,
"_source": {
"locality_town": "Manchester",
"locality_town_keyword": "Manchester"
}
}
]
},
"aggregations": {
"loc": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
}
}
This is how one document looks like
{
"_index": "tenderindex_2",
"_type": "tender_2",
"_id": "290077",
"_version": 1,
"_seq_no": 39,
"_primary_term": 1,
"found": true,
"_source": {
"title": "Legal Services",
"buyers": "CENTRAL MANCHESTER UNIVERSITY HOSPITALS NHS FOUNDATION TRUST",
"postal_code": "M13 0JR",
"publish_date": "2015-03-03T15:48:45Z",
"status": "cancelled",
"start_date": "2017-03-03T00:00:00Z",
"endt_date": "2020-03-03T00:00:00Z",
"url": "https://www.temp.com",
"country": "England",
"description": "desc......",
"language": "en-GB",
"service": "OPEN_CONTRACTING",
"value": "0",
"value_currency": "GBP",
"winner": "",
"create_time": "2019-05-11T21:39:42Z",
"deadline_date": "1970-01-01T00:00:00Z",
"address": "Central Manchester University Hospitals NHS Foundation Trust Wilmslow Park",
"locality_town": "Manchester",
"locality_town_keyword": "Manchester",
"region": "North West",
"tender_type": "planning",
"cpv": "Health services ",
"strpublish_date": "2015-03-03T15:48:45Z",
"strstart_date": "2017-03-03T00:00:00Z",
"strend_date": "2020-03-03T00:00:00Z",
"strdeadline_date": "",
"winner_email": "",
"winner_address": "",
"winner_town": "",
"winner_postalcode": "",
"winner_phone": "",
"cpvs": "[\"Health services (85100000-0)\"]"
}
}
Looks like you have a typo in your aggregation query:
"aggs": {
"loc": {
"terms": {
"field": "locality_town_keyoword", <== here
"size": 5,
Try with locality_town_keyword instead!
Hope this helps!

How can I fetch all disctinct objects within a field over an elasticsearch index?

I am trying to get all distinct objects from within a field over the whole index.
What I tried so far is:
POST http://es5server:9200/indexname/_search
Content-Type: application/json
{
"size": "1",
"aggs": {
"tags": {
"terms": {
"field": "tags"
}
}
}
}
Currently this returns the following for me (I set size to 1 to include a sample document):
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.0,
"hits": [
{
"_index": "indexname",
"_type": "news",
"_id": "51",
"_score": 1.0,
"_source": {
"localized": {
"de": {
"title": null,
"shorttext": null,
"text": null
},
"en": {
"title": "test new title",
"shorttext": "hello my name is mayur and this is testnews text",
"text": null
}
},
"type": "object",
"key": "testnews",
"path": "\/",
"tags": {
"38": {
"name": "I AM",
"parent": "0"
},
"45": {
"name": "ffddd",
"parent": "43"
},
"43": {
"name": "kkjjttdd",
"parent": "0"
}
}
}
}
]
},
"aggregations": {
"tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
}
}
The buckets are empty, is this because the tags field contains objects and not text?
How can I get ES to return all distinct objects within the tags fields of all documents?
Looks like you need to have nested objects for you solution. See this question answered here

How to sort bucket result based on viewed_timestamp in ElasticSearch?

I am new to Elastic Search. I want to find the top 10 unique recent visited doc_id.
I have done first aggregation on doc_id and added sub-aggregation to sort each group and get a single result. Now I want to sort this bucket.
I am not able to sort the bucket's result based on view_timestamp. How can I add order during first aggregation?
I have tried other solutions given on stack overflow, but it is not working for me. Can anyone help me to solve this problem?
Query
{
"query": {
"constant_score": {
"filter": {
"term": { "username": "nil#gmail.com" }
}
}
},
"size":0,
"aggs":{
"title": {
"terms": {
"field": "doc_id",
"size":0
}
,
"aggs": {
"top": {
"top_hits": {
"sort": [
{
"viewed_timestamp": {
"order": "desc"
}
}
],
"size": 1
}
}
}
}
}
}
Bucket result:
{
"aggregations": {
"title": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "b003",
"doc_count": 3,
"top_tag_hits": {
"hits": {
"total": 3,
"max_score": null,
"hits": [{
"_index": "visitedData",
"_type": "userdoc",
"_id": "AVak51Sp",
"_score": null,
"_source": {
"viewed_timestamp": "20160819T152359",
"content_type": "bp",
"title": "Data print",
"doc_id": "BP003"
},
"sort": [
1471620239000
]
}]
}
}
}, {
"key": "bp004",
"doc_count": 3,
"top_tag_hits": {
"hits": {
"total": 3,
"max_score": null,
"hits": [{
"_index": "visitedData",
"_type": "userdoc",
"_id": "AVak513Y8G",
"_score": null,
"_source": {
"viewed_timestamp": "20160819T152401",
"content_type": "bp",
"title": "Application Print",
"doc_id": "BP004"
},
"sort": [
1471620241000
]
}]
}
}
}]
}
}
}
it is beacuse your view_timestap type is not date, it is timesatmp. you should change this field to date format, such as:
"updateTime": "2017-01-12T21:28:49.562065"
If you're only trying to order by the timestamp, you could try using a max aggregation, like in this example:
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/search-aggregations-metrics-top-hits-aggregation.html#_field_collapse_example

Resources