I have a document with nested collection and the goal is to find that ones which don't have any inner items intersecting particular period considering also an item's status.
inb4 I've found at least two SO items that didn't help me (probably I'm noob):
ElasticSearch - find all documents whose nested documents do not intersect with date range
and
Elasticsearch inverse range overlap query
So, my document, let's say, simple (I'll paste a JSON definition, just for simplicity. all mappings are fine, trust me ;) ):
{
"maintenances": [ // <-- this is nested collection
{
"start": "date/time",
"end": "date/time",
"status": boolean
} ]
}
What I need is to write a query for documents that don't have any active (status = true) maintenance items intersecting some period (from and to for instance).
I started from simple expression:
must_not { nested { exists { field: maintenances } } }
or
must_not { nested { must [
{ maintenances.start <= to },
{ maintenances.end >= from },
{ status = true }
] } }
That returned me any document from test entries.
Remembering that nested query will return outer document in case if any document will match expression I decided to make a query complex, something like:
maintenances not exists
or
(
any maintenance within range is not active
and
any maintenance outside range is active
)
But it became clear (not very fast unfortunately) that this query doesn't work for sorts of edge-cases: like all nested maintenance items are inactive, or all maintenance are outside requested bounds.
Currently I'm not sure, but I assume that query should contain as many or-d items as many edge-cases are? Like:
maintenances not exists
or
(
any maintenance within range is not active
and
any maintenance outside range
)
or
(
any maintenance within range is not active
and
no maintenance outside range
)
or
(
no maintenance within range
and
any maintenance outside range
)
or
OVER9000 of them
Does anyone know simplest way to query Elastic for my case?
all mappings are fine, trust me
Suddenly (no actually) this was an issue. Absence of mapping for status prevented from data being filtered correctly.
Related
Hope everyone is staying safe!
I am trying to explore the proper way to tacke the following use case in elasticsearch
Lets say that I have about 700000 docs which I would like to bucket on the basis of a field (let's call it primary_id). This primary id can be same for more than one docs (usually upto 2-3 docs will have same primary_id). In all other cases the primary_id is not repeted in any other docs.
So on average out of every 10 docs I will have 8 unique primary ids, and 1 primary id same among 2 docs
To ensure uniqueness I tried using the terms aggregation and I ended up getting buckets in response to my search request but not for the subsequent scroll requests. Upon googling, I found that scroll queries do not support aggregations.
As a result, I tried finding alternates solutions, and tried the solution in this link as well, https://lukasmestan.com/learn-how-to-use-scroll-elasticsearch-aggregation/
It suggests use of multiple search requests each specifying the partition number to fetch (dependent upon how many partitions do you divide your result in). But I receive client timeouts even with high timeout settings client side.
Ideally, I want to know what is the best way to go about such data where the variance of the field which forms the bucket is almost equal to the number of docs. The SQL equivalent would be select DISTINCT ( primary_id) from .....
But in elasticsearch, distinct things can only be processed via bucketing (terms aggregation).
I also use top hits as a sub aggregation query under terms aggregation to fetch the _source fields.
Any help would be extremely appreciated!
Thanks!
There are 3 ways to paginate aggregtation.
Composite aggregation
Partition
Bucket sort
Partition you have already tried.
Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.
GET index22/_search
{
"size": 0,
"aggs": {
"ValueCount": {
"value_count": {
"field": "id.keyword"
}
},
"pagination": {
"composite": {
"size": 2,
"sources": [
{
"TradeRef": {
"terms": {
"field": "id.keyword"
}
}
}
]
}
}
}
}
Bucket sort
The bucket_sort aggregation, like all pipeline aggregations, is
executed after all other non-pipeline aggregations. This means the
sorting only applies to whatever buckets are already returned from the
parent aggregation. For example, if the parent aggregation is terms
and its size is set to 10, the bucket_sort will only sort over those
10 returned term buckets
So this isn't suitable for your case
You can increase the result size to value greater than 10K by updating setting index.max_result_window. Setting too big a size can cause out of memory issue so you need to test it out see how much your hardware can support.
Better option is to use scroll api and perform distinct at client side
The best way to explain what I want to accomplish is by example.
Let us say that I have an object with fields name and color and transaction_id. I want to search for documents where name and color match the specified value and that I can accomplish easily with boolean queries.
But, I do not want only documents which were found with search query. I also want transaction to which those documents belong, and that is specified with transaction_id. For example, if a document has been found with transaction_idequal to 123, I want my query to return all documents with transaction_idequal to 123.
Of course, I can do that with two queries, first one to fetch all documents that match criteria, and the second one that will return all documents that have one of transaction_idvalues found in first query.
But is there any way to do it in a single query?
You can use parent-child relation ship between transaction and your object. Or nest the denormalize your data to include the objects in the transactions. Otherwise you'll have to do an application side join, meaning 2 queries.
Try an index mapping similar to the following, and include a parent_id in the objects.
{
"mappings": {
"transaction": {},
"object": {
"_parent": {
"type": "transaction"
}
}
}
}
Further reading:
https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child-mapping.html
i'm looking to add a feature to an existing query. Basically, I run a query that returns say 1000 documents. Those documents all have the same structure, only the values of certain fields vary. What i'd like, is to not only get the full list as a result, but also count how many results have a field X with the value Y, how many results have the same field X with the value Z etc...
Basically get all the results + 4 or 5 "counts" that would act like the SQL "group by", in a way.
The point of this is to allow full text search over all the clients in our database (without filtering), while showing how many of those are active clients, past clients, active prospects etc...
Any way to do this without running additional / separate queries ?
EDIT WITH ANSWER :
Aggregations is the way to go. Here's how I did it, it's so straightforward that I expected much harder work !
{
"query": {
"term": {
"_type":"client"
}
},
"aggregations" : {
"agg1" : {
"terms" : {
"field" : "listType.typeRef.keyword"
}
}
}
}
Note that it's even in a list of terms and not a single field, that's just how easy it was !
I believe what you are looking for is the aggregation query.
The documentation should be clear enough, but if you struggle please give us your ES query and we will help you from there.
I am making use of the functionality that was added in Elasticsearch 1.5 to allow a top hits aggregation inside a nested aggregation. The problem I have is that once I have my top nested documents I want to be able to also get fields from their outer documents.
my pseudo aggregation structure is
nested: {
some_other_aggreagation: {
"top_hits": {
}
}
}
The top nested hits include the index, type and id of the outer document, so I could perform a secondary search, but I'd like to avoid that. My other option is to return all of the hits from the query (currently I only return the results of the aggregations) and then match up the documents with the events in my code, but that seems bad from a performance point of view.
Can anyone suggest something better? Thanks.
I have a question about highlighting nested object fields.
Consider record like this:
_source: {
id: 286
translations: [
{
id: 568
language: lang1
value: foo1 bar1
}
{
id: 569
language: lang2
value: foo2 bar2
}
]
}
If the translations.value has ngram filter, is it possible to highlight matches in nested object such as this one?
And how would the highlight query look like.
Thanks a lot for response.
Same problem over here. It seems that there is now way to do it in elastic search and won't be in near future.
Developer Shay Banon wrote:
In order to do highlighting based on the nested query, the nested
documents needs to be extracted as well in order to highlight it,
which is more problematic (and less performant).
Also:
His explanation was that this would take a good amount of memory as
there can be a large number of children. And it looks genuine to me as
adding this feature will violate the basic concept of processing only
N number of feeds at a time.
So the only way is to process the result of a query manually in your own programm to add the highlights.
Update
I don't know about tire or ngram filters but i found a way to retrieve all filter matching nested documents by using nested facets and facet filters. You need a seperate query for highlighting but its much faster than browsing through _source, in my case at least.
{"query":
{"match_all":{}},
"facets":{
"matching_translations":{
"nested":"translations",
"terms":{"field":"translations.value"},
"facet_filter":{
"bool":{"must":[{"terms":{"translations.value":["foo1"]}}]}
}
}
}
}
You can use the resulting facet terms for highlighting in your programm.
For example: i want to highlight links to nested documents (in jquery):
setHighlights = function(sdata){
var highlightDocs = [];
if(sdata['facets'] && sdata['facets']['docIDs'] && sdata['facets']['doctIDs']['terms'] && sdata['facets']['docIDs']['terms'].length >0){
for(var i in sdata['facets']['docIDs']['terms']){
highlightDocs.push(sdata['facets']['docIDs']['terms'][i]['term'])
}
}
$('li.document_link').each(function(){
if($.inArray($(this).attr('id'),highlightDocs) != -1) {
$(this).addClass('document_selected');
}
});
I hope that helps a little.
You can use force_source" : true in the fields to cause the document be highlighted once nested fields are joined.