ElasticSearch aggregation: exclude one filter per aggregation - elasticsearch

I want to filter out documents whose field 'A' is equal to 'a', and I want to facet the field 'A' at the same time, excluding of course the previous filter.
I know that you can put the filter 'outside' the query in order to get the facets without that filter applied, like:
ElasticSearch
{
"query : { "match_all" : { } },
"filter" : { "term : { "A" : "a" } },
"facets" : {
"A" : { "terms" : { "field" : "A" } } //this should exclude the filter A:a
}
}
SOLR
&q=:*:*
&fq={!tag=Aa}A:a
&facet=true&facet.field={!ex=Aa}A
This is very nice, but what happens if i have multiple filters and facets that each one should exclude each other?
Example:
filter=A:a
filter=B:b
filter=C:c
facet={exclude filter A:a}A
facet={exclude filter B:b}B
facet={exclude filter C:c}C
That is, for facet A I want to keep all filters except A:a, for facet B all except B:b, and so on.
The most obvious way would be to do n queries (one per each of the n facets), but I'd like to stay away from that.

The global scope provides access to every document, you can then add the same filters you used for the main query.
I gave an example with global scope in this related topic
Could you give any feedback about performance issue with post_filter ?

Related

Elasticsearch - boosting fields for multi match without specifying complete field list in query

I am trying to boost fields using multi match query without specifying complete field list but I cannot find out how to do it. I am searching through multiple indices on all fields, which I don't know at the run time, but I know which are the important ones.
For example I have index A with the fields 1,2,3,4 and index B with fields 1,5,6,7,8. I need to search across both indexes through all fields with the boosting on field 1.
So far I got
GET A,B/_search
{
"query": {
"multi_match" : {
"query" : "somethingToSearch"
}
}
}
Which goes through all fields on both indices, but I would like to have something like this (boosting match on field 1 before the others)
GET A,B/_search
{
"query": {
"multi_match" : {
"query" : "somethingToSearch",
"fields" : ["1^5,*"]
}
}
}
Is there any way how to do it without using bool queries?

Queries vs Filters - Order of execution

I've read this question and a colleague of mine made me doubt:
In a filtered query, when is the filter applied ? Before or after executing the query ? When is the result cached ?
If the filter is applied beforehand, wouldn't it be a a good thing to duplicate the query part in the filters ?
If the filter is applied afterward, then i'm having trouble understanding what is cached.
Luckily, ES provides two types of filters for you to work with:
{
"query" : {
"field" : { "title" : "Catch-22" }
},
"filter" : {
"term" : { "year" : 1961 }
}
}
{
"query": {
"filtered" : {
"query" : {
"field" : { "title" : "Catch-22" }
},
"filter" : {
"term" : { "year" : 1961 }
}
}
}
}
In the first case, filters are applied to all documents found by the query. In the second case, the documents are filtered before the query runs. This yields better performance.
Quoted from: http://www.packtpub.com/elasticsearch-server-for-fast-scalable-flexible-search-solution/book
About cache, I'm not sure about cache mechanism of filters.
My guessing would be:
First case, since the filter is against a set of results returned by query, the cache is kind of specific for this return set.
Second case, the filter is applied first, the cache is stored for the indices you checked against, thus, this cache is more reusable because it does not rely on the content of the query, but at larger memory cost and query time for first time(before the cache is generated).
Let me explain you search query execution-
First thing is that there is always a Complete document of reference in which you want to search.
If you have filter query included with search query then it will just make that document smaller or in other words filter queries are cached results of same query.
Now you have a smaller tree to search from with your query text.
Now your doubt part- Duplicating the query in filters will only increase overhead of cache mechanism and There are many guide lines on what to include in filter query and what to ignore. It's all play of relevancy.

Elasticsearch: include specific facet values

elasticsearch provides parameter to exclude certain facets from facets values like this.
"facets" : {
"tag" : {
"terms" : {
"field" : "tag",
"exclude" : ["term1", "term2"]
}
}
}
Is there any possibility to include certain facets?
I'm trying to get counts for facets that have been already selected by user along with global facets. E,g. you selected word science with count 20 (from autocomplete), i recompute facets to show other words that migh be selected, but the word science would not get to facet results since other words from global facets have count more than 400.
Is there any particular solution for this task?
Thanks for help
You can use scripting for that. The script will be run for each facet entry with the input variable term that contains the current value. The entry will be included or not on the final facet depending on the result of the script. If it returns false it will be excluded, otherwise it will be included.
"facets" : {
"tag" : {
"terms" : {
"field" : "tag",
"script" : "term == 'aaa' ? true : false"
}
}
}

Can you refer to and filter on a script field in a query expression, after defining it?

I'm new to ElasticSearch and was wondering, once you define a script field with mvel syntax, can you subsequently filter on, or refer to it in the query body as if it was any other field?
I can't find any examples of this while same time I don't see any mention of whether this is possible on the docs page
http://www.elasticsearch.org/guide/reference/modules/scripting/
http://www.elasticsearch.org/guide/reference/api/search/script-fields/
The book ElasticSearch Server doesn't mention if this is possible or not either
As for 2018 and Elastic 6.2 it is still not possible to filter by fields defined with script_fields, however, you can define custom script filter for the same purpose. For example, lets assume that you've defined the following script field:
{
"script_fields" : {
"some_date_fld_year":"doc["some_date_fld"].empty ? null : doc["some_date_fld"].date.year"
}
}
you can filter by it with
{
"query": {
"bool" : {
"must" : {
"script" : {
"script" : {
"source": " (doc["some_date_fld"].empty ? null : doc["some_date_fld"].date.year) >= 2017",
"lang": "painless"
}
}
}
}
}
}
It's not possible for one simple reason: the script_fields are calculated during final stage of search (fetch phase) and only for the records that you retrieve (top 10 by default). The script filter is applied to all records that were not filtered out by preceding filters and it happens during query phase, which precedes the fetch phase. In other words, when filters are applied the script_fields don't exist yet.

How to perform a date range elasticsearch query given multiple dates per document?

I'm using ElasticSearch to index forum threads and reply posts. Each post has a date field associated with it. I'd like to perform a query that includes a date range which will return threads that contain posts matching a date range. I've looked at using a nested mapping but the docs say the feature is experimental and may lead to inaccurate results.
What's the best way to accomplish this? I'm using the Java API.
You haven't said much about your data structure, but I'm inferring from your question that you have post objects which contain a date field, and presumably a thread_id field, ie some way of identifying which thread a post belongs to?
Do you also have a thread object, or is your thread_id sufficient?
Either way, your stated goal is to return a list of threads which have posts in a particular date range. This means that you need to group your threads (rather than returning the same thread_id multiple times for each post in the date range).
This grouping can be done by using facets.
So the query in JSON would look like this:
curl -XGET 'http://127.0.0.1:9200/posts/post/_search?pretty=1&search_type=count' -d '
{
"facets" : {
"thread_id" : {
"terms" : {
"size" : 20,
"field" : "thread_id"
}
}
},
"query" : {
"filtered" : {
"query" : {
"text" : {
"content" : "any keywords to match"
}
},
"filter" : {
"numeric_range" : {
"date" : {
"lt" : "2011-02-01",
"gte" : "2011-01-01"
}
}
}
}
}
}
'
Note:
I'm using search_type=count because I don't actually want the posts returned, just the thread_ids
I've specified that I want the 20 most frequently encountered thread_ids (size: 20). The default would be 10
I'm using a numeric_range for the date field because dates typically have many distinct values, and the numeric_range filter uses a different approach to the range filter, making it perform better in this situation
If your thread_ids look like how-to-perform-a-date-range-elasticsearch-query then you can use these values directly. But if you have a separate thread object, then you can use the multi-get API to retrieve these
your thread_id field should be mapped as { "index": "not_analyzed" } so that the whole value is treated as a single term, rather than being analyzed into separate terms

Resources