Counting records in elasticsearch by avoiding duplicates - elasticsearch

For the search results of /_search, I would like to get the count of the total records, after applying a condition such that if there are multiple records with the same value in fieldxyz, I would like to count it only one record. For example, here are the full results:
Doc 1 {field_one:'value one' , fieldxyz: 'value four';}
Doc 2 {field_one:'value two' , fieldxyz: 'value five';}
Doc 3 {field_one:'value three' , fieldxyz: 'value four';}
Because 'value four' occurs twice, I would like to count those two records as one, and the final count should be 2.
How can I do that?

You can use the following elasticsearch cardinality aggregation to get the count of distinct values for a field:
{
"aggs": {
"counting": {
"cardinality": {
"field": "fieldxyz"
}
}
}
}

Related

Elasticsearch "size" value not working in terms aggregation with partitions

I am trying to paginate over a specific field using the terms aggregation with partitions.
The problem is that the number of returned terms for each partition is not equal to the size parameter that I set.
These are the steps that I am doing:
Retrieve the number of different unique values for the field with "cardinality" aggregation.
In my data, the result is 21.
From the web page, the user wants to display a table with 10 items per page.
if unique_values % page_size != 0:
partitions_number = (unique_values // page_size) + 1
else:
partitions_number = (unique_values // page_size)
Than I am making this simple query:
POST my_index/_search?pretty
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"match": {
"field_to_paginate": "foo"
}
}
]
}
},
"aggs": {
"by_pchostname": {
"terms": {
"size": 10,
"field": "field_to_paginate",
"include": {
"partition": 0,
"num_partitions": 3
}
}
}
}
}
I am expecting to retrieve 10 results. But if I run the query I have only 7 results.
What am I missing here? Do I need to use a different solution here?
As a side note, I can't use composite aggregation because I need to sort results by doc_count over the whole dataset.
Partitons in terms aggregation divide the values in equal chunks.
In your case no of partition num_partitions is 3 so 21/3 == 7.
Partitons are meant for getting large values in the order of 1000 s.
You may be able to leverage shard_size parameter. My suggestion is to read this part of manual and work with the shard_size param
Terms aggregation does not allow pagination. Use composite aggregation instead (requires ES >= 6.1.0). Below is the quote from reference docs:
If you want to retrieve all terms or all combinations of terms in a
nested terms aggregation you should use the Composite aggregation
which allows to paginate over all possible terms rather than setting a
size greater than the cardinality of the field in the terms
aggregation. The terms aggregation is meant to return the top terms
and does not allow pagination.

Complex ElasticSearch Query

I have documents with (id, value, modified_date). Need to get all the documents for ids which have a specific value as of the last modified_date.
My understanding is that I first need to find such ids and then put them inside a bigger query. To find such ids, looks like, I would use "top_hits" with some post-filtering of the results.
The goal is to do as much work as possible on the server side to speed things up. Would've been trivial in SQL, but with ElasticSearch I am at a loss. And then I would need to write this in python using elasticsearch_dsl. Can anyone help?
UPDATE: In case it's not clear, "all the documents for ids which have a specific value as of the last modified_date" means: 1. group by id, 2. in each group select the record with the largest modified_date, 3. keep only those records that have the specific value, 4. from those records keep only ids, 5. get all documents where ids are in the list coming from 4.
Specifically, 1 is an aggregation, 2 is another aggregation using "top_hits" and reverse sorting by date, 3 is an analog of SQL's HAVING clause - Bucket Selector Aggregation (?), 4 _source, 5 terms-lookup.
My biggest challenge so far has been figuring out that Bucket Selector Aggregation is what I need and putting things together.
This shows an example on how to get the latest elements in each group:
How to get latest values for each group with an Elasticsearch query?
This will return the average price bucketed in days intervals:
GET /logstash-*/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"2": {
"date_histogram": {
"field": "#timestamp",
"interval": "1d",
"time_zone": "Europe/Berlin",
"min_doc_count": 1
},
"aggs": {
"1": {
"avg": {
"field": "price"
}
}
}
}
}
}
I wrote it so it matches all record, that obviously returns more data than you need. Depending on the amount of data it might be easier to finish the task on client side.

Searching for multiple values in a String array in Elastic

I have a field that I am indexing into Elasticsearch that is an array of strings. So, for example, here is what the string array will look like in two records:
Record 1: {"str1", str2", str3", "str4", "str5"}
Record 2: {"str1", str2", str6", "str7", "str8"}
Question 1: I want to be able to query for multiple strings in this array. For e.g. my query has "str1", "str2". "str3" as the search parameter. I want to search for records where the string array has any of these three strings
Question 2: For the scenario above will Record 1 return with a higher score than record 2 (since all three strings are in the array for record 1 but only two are there in record 2).
Is this possible at all? Can you please help with what the query should look like and if the scoring works the way I stated.
You can index them as an array, such as:
{
"myArrayField": [ "str1", str2", str3", "str4", "str5" ],
...
}
You would then be able to query a number of ways, the simplest for your case being a match query (which is analyzed):
{
"match" : {
"myArrayField" : "str1 str2 str3"
}
}
Or a terms query (which is not analyzed):
{
"terms" : {
"myArrayField" : [ "str1", "str2", "str3" ]
}
}
And Yes, matches against more query terms will receive a higher score, so Record 1 would be scored higher than Record 2.

Aggregate Terms Usage Count

I'm trying to work out a way of finding the most popular terms and their usage in ElasticSearch. The Terms Aggregation is very close but returns the count of documents that the term appeared in, rather than how many times the term appeared.
For example, imagine an appropriate index has been created to index these example documents:
{ text: 'one two two' }
{ text: 'two three' }
Then executing the following search:
{
aggregations: {
popular_terms: {
terms: {
field: 'text'
}
}
}
}
Will return:
... {
buckets: [
{ key: 'two', value: 2 },
{ key: 'one', value: 1 },
{ key: 'three', value: 1 }
]
}
Is it possible to search with an aggregation counting instances of the terms in a similar way? So in this example returning 3 for the value 'two' as it appears twice in the first document?
Aggregation counts the number of documents based on a criteria (eg: terms ). So it won't return what you are expecting.
For your use case you can probably use the term vector

Sorting by value in multivalued field in elasticsearch

I have a multivalue field with integers in the document, for example
{
values: [1,2,3,4,5]
}
I apply range filter, for example from 2 to 4 and get list of document with values, contains 2,3,4.
Now I'd like to sort results, and first return documents, which contains 3.
I could do it using script sorting:
{
sort:{
_script: {
script: "doc['values'].getValues().contains(3) ? 0 : 1",
type: "number"
}
}
}
But I don't like it's performance, because getValues() returns a List actually, and contains methods is O(n).
Are any better ways?

Resources