Compare IDs between two indices in elasticsearch - elasticsearch

I have two indices in an elasticsearch cluster, containing what ought to be the same data in two slightly different formats. However, the number of records are different. The IDs of each document should be the same. Is there a way to extract a list of what IDs are present in one index but not the other?

If your two indices have the same type where these documents are stored, you can use something like this:
GET index1,index2/_search
{
"size": 0,
"aggs": {
"group_by_uid": {
"terms": {
"field": "_uid"
},
"aggs": {
"count_indices": {
"cardinality": {
"field": "_index"
}
},
"values_bucket_filter_by_index_count": {
"bucket_selector": {
"buckets_path": {
"count": "count_indices"
},
"script": "params.count < 2"
}
}
}
}
}
}
The query above works in 5.x. If your ID is a field inside a document, that's even better to test.

For anyone that comes across this, Scrutineer (https://github.com/Aconex/scrutineer/) provides this sort of ability if you follow convention of ID & Version concepts within Elasticsearch.

Related

Paginate an aggregation sorted by hits on Elastic index

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client. Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months. Thanks to another question, I have a query that returns all the results:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 1000
}
}
},
"size": 0
}
Now, I want to have a paginated result. The term aggreation cannot be paginated, so I use a composite aggregation. Of course, if there is a better aggregation, it can be used here...
So for the moment, I have something like that:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads_agg": {
"composite": {
"size": 100,
"sources": [
{
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
]
}
}
},
"size": 0
}
This aggregation allows me to paginate (thanks to after_key value in response), but it is not sorted by the number of downloads - it is sorted by the filename.
How can I sort that composite aggregation on the number of documents for each filename in my index?
Thanks.
Composite aggregation don't allow sorting based on the value field.
Excerpt from the discussion on elastic forum:
it's designed as a memory-friendly way to paginate over aggregations.
Part of the tradeoff is that you lose things like ordering by doc
count, since that isn't known until after all the docs have been
collected.
I have no experience with Transforms (part of X-pack & Licensed) but you can try that out. Apart from this, I don't see a way to get the expected output.

Pagination with specific search type on ElasticSearch

We are currently using ElasticSearch 6.7 and have a huge amount of data making some request taking too much time.
To avoid this problem, we want to set up pagination within our research towards elasticsearch. The problem is that I can't put one of the pagination methods proposed by ES on the different requests that already exist.
For example, this request contains different aggregations and a query:
https://github.com/trackit/trackit/blob/master/usageReports/lambda/es_request_constructor.go#L61-L75
In addition, the results are sorted after the information is collected.
I tried to set up the Search After method as well as a form of pagination using from & size.
Scroll doesn't works with aggregations and composite aggregation doesn't accept query.
So, there is any good way to do pagination in ElasticSearch combined with other request type and how to do it with the example above?
composite aggregation doesn't accept query
It does accept query. In the example below, the results are filtered based on play_name. The aggregation only get applied to the result of the query and it can be paginated using the after option.
{
"query": {
"term": {
"play_name": "A Winters Tale"
}
},
"size": 0,
"aggs": {
"speaker": {
"composite": {
"after": {
"product": "FLORIZEL"
},
"sources": [
{
"product": {
"terms": {
"field": "speaker"
}
}
}
]
},
"aggs": {
"speech_number": {
"terms": {
"field": "speech_number"
},
"aggs": {
"line_id": {
"terms": {
"field": "line_id"
}
}
}
}
}
}
}
}

elasticsearch terms aggregation output keys

Suppose I have a document like
doc :{
item: {name: "Movie1", code: "M1"}
}
I can simply use terms aggregation on item.code and get all the buckets. But, is it possible to use aggregation on item.code but get the output bucket key to be the value of item.name
PS: I know I could use item.name in the terms aggregation, but due the nature of data (the names store vary slightly hence I have to use code), I need to bucket by code but output key as name.
Not exactly what you are looking for but it does what you need:
{
"size": 0,
"aggs": {
"whatever": {
"terms": {
"field": "item.code",
"size": 10
},
"aggs": {
"top1": {
"top_hits": {
"size": 1,
"_source": {"exclude": "*"},
"fields": ["item.name"]
}
}
}
}
}
}

Getting cardinality of multiple fields?

How can I get count of all unique combinations of values of 2 fields that are present in documents of my database, i.e. achieve the same functionality as the "cardinality" aggregation provides, but for more than 1 field?
You can use a script to achieve this. Assuming the character '#' is not present in any value of both the fields (you can use anything else to act as a separator), the query you're looking for is as under. Mind you, scripting will come with a performance hit.
{
"aggs" : {
"multi_field_cardinality" : {
"cardinality" : {
"script": "doc['<field1>'].value + '#' + doc['<field2'].value"
}
}
}
}
Read more about it here.
A better solution is to use nested aggregations and then count the resulting buckets.
"aggs": {
"Group1": {
"terms": {
"field": "Field1",
"size": 0
},
"aggs": {
"Group2": {
"terms": {
"field": "Field2",
"size": 0
}
}
}
}
}

show all buckets from aggregation within a single _type where one index contains multiple _type with same field names

I created an index named "electronics". I created two _type in index i.e "mobiles", "laptops" which have common field name "screensize".
Since I need to show facets for all the terms present in the fields, I am using aggregations to generate the terms and its facets.
{
"aggs": {
"distinct_field": {
"terms": {
"field": "screensize",
'min_doc_count': 0,
'size': 0
}
}
}
}
In the response I am getting all the screensizes with _type of mobiles as well as laptops(Since lucene treats same field names from different types as single field.). I only need the terms present in mobiles even if their count is 0.
I thought about doing a filtered query for mobiles _type before doing aggregations, but the results were still the same.
{
"query": {
"filtered": {
"filter": {
"type": {
"value": "mobiles"
}
}
}
},
"aggs": {
"distinct_field": {
"terms": {
"field": "screensize",
'min_doc_count': 0,
'size': 0
}
}
}
}
Is there any way I could possibly get only the terms from a single _type for a particular field?
I'm suggesting another approach using a terms aggregation with a script instead of field like this. The script will only return the value of the screensize if the type of the document is mobiles and null instead. This should work, try it out:
{
"aggs": {
"distinct_field": {
"terms": {
"script": "doc._type.value == 'mobiles' ? doc.screensize.value : null",
"min_doc_count": 0,
"size": 0
}
}
}
}
For this to work you also need to make sure that scripting is enabled

Resources