Elasticsearch returning distinct results - elasticsearch

Here is my ES query:
{
"fields": [
"news.authorname.raw",
"news.authorid"
],
"query": {
"filtered": {
"filter": {
"terms": {
"news.authorid": [
1,
2
]
}
}
}
}
}
With this query I get a list of pairs {authorid, authorname}. This list has repeated {authorid, authorname} values and I just need to get the same list but with no repetitions. This seems not that difficult or at least that is what I thought this morning. My small knowledge of ES together with the lack of documentation is making me desperate to find a solution to such a trivial problem.
Of course I could get the whole list and remove repetitions through code, but if it was possible I would prefere not to receive unnecessary data to have it removed afterwards.
Anyone can give a hand on that? Should I use some other approach?
Thanks in advance!!

I would suggest to use source filtering:
{
"_source": [ "news.authorname.raw", "news.authorid" ],
"query": {
"filtered": {
"filter": {
"terms": {
"news.authorid": [
1,
2
]
}
}
}
}
}
It is generally easier to handle than fields, which sometimes do look like a cartesian product.

Related

How does the flow works in elasticsearch queries?

I have written a query which has couple of condition as shown below.
GET /agreement/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "T-0668",
"fields": [
"agreecondition.agreementId",
"agreecondition.conditionContractId"
]
}
},
,
{
"range": {
"agreecondition.validFrom": {
"gte": "02/18/2019"
}
}
},
{
"range": {
"agreecondition.validTo": {
"lte": "03/07/2019"
}
}
}
],
"filter": [
{
"terms": {
"agreecondition.promotionId.keyword": [
"x",
"y"
]
}
}
]
}
}
}
My question is how the flow works?
Ex: Does the ES first gets the results for the must condition's multi-match and on the output of the multi-match, does the range condition applies? followed by filter(on top of the output of the range condition)?
I just wanted to get a clarity on this, if my assumption is wrong, then i need to re-write the query.
You can check elasticsearch official blog on query execution order to understand this in details but you might just not get all the details you are looking for, due to limitation elastic put as mentioned at the end of the blog:
Q: How can I check which query/filter got executed first?
A: We don't really expose this information, which is very internal. However if you
check the output of the profile API, you can count how many times
nextDoc/advance have been called on the one hand, and matches on the
other hand. Query nodes that have the higher counts have been run
first.
Note: Profile API will be very handful for you as suggested in the blog as well.

Elasticsearch: How to write an 'OR' clause in filter context?

I'm looking for syntax/example compatible with ES version is 6.7.
I have seen the docs, I don't see any examples for this and the explanation isn't clear enough to me. I have tried writing query according to that, but I keep on getting syntax error. I have seen below questions on SO already but they don't help me:
Filter context for should in bool query (Elasticsearch)
It doesn't have any example.
Multiple OR filter in Elasticsearch
I get a syntax error
"type": "parsing_exception",
"reason": "no [query] registered for [filtered]",
"line": 1,
"col": 31
Maybe it's for a different version of ES.
All I need is a simple example with two 'or'ed conditions (mine is one range and one term but I guess that shouldn't matter much), both I would like to have in filter context (I don't care about scores, nor text search).
If you really need it, I can show my attempts (need to remove some 'sensitive'(duh) parts from it before posting), but they give parsing/syntax errors so I don't think there is any sense in them. I am aware that questions which don't show any efforts are considered bad for SO but I don't see any logic in showing attempts that aren't even parsed successfully, and any example would help me understand the syntax.
You need to wrap your should query in a filter query.
{
"query":{
"bool":{
"filter":[{
"bool":{
"should":[
{ // Query 1 },
{ // Query 2 }
]
}
}]
}
}
}
I had a similar scenario (even the range and match filter), with one more nested level, two conditions to be 'or'ed (as in your case) and another condition to be logically 'and'ed with its result. As #Pierre-Nicolas Mougel suggested in another answer I had nested bool clauses with one more level around the should clause.
{
"_source": [
"my_field"
],
"query": {
"bool": {
"filter": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"range": {
"start": {
"gt": "1558878457851",
"lt": "1557998559147"
}
}
},
{
"range": {
"stop": {
"gt": "1558898457851",
"lt": "1558899559147"
}
}
}
]
}
},
{
"match": {
"my_id": "<My_Id>"
}
}
],
"must_not": []
}
}
}
},
"from": 0,
"size": -1,
"sort": [],
"aggs": {}
}
I read in the docs that minimum_should_match can be used too for forcing filter context. This might help you if this query doesn't work.

ElasticSearch Query DSL Combine Terms and Wildcard

I have to distinct queries which are working well enough alone:
{"wildcard":{"city":"*Beach*"}}
{"terms":{"state":["Florida","Georgia"]}}
but trying to combine them into one query is proving to be quite the challenge.
I had thought just doing simply {{"wildcard":{"city":"*Beach*"}},{"terms":{"state":["Florida","Georgia"]}}} would do it, but it does not. So then I tried a few different iterations using arrays, and bool queries etc. Can someone point me in the correct direction?
Bool query should be the right way to go.
Below is an example for your use case:
{
"query": {
"bool": {
"must": [
{
"wildcard": { "city": "*Beach*" }
},
{
"terms": {
"state": [ "Florida", "Georgia" ]
}
}
]
}
}
}
If there is not result, it means that there is no entry matching both of the criteria.

Elastic Search Filter performing much slower than Query

As my ES index/cluster has scaled up (# ~2 billion docs now), I have noticed more significant performance loss. So I started messing around with my queries to see if I could squeeze some perf out of them.
As I did this, I noticed that when I used a Boolean Query in my Filter, my results would take about 3.5-4 seconds to come back. But if I do the same thing in my Query it is more like 10-20ms
Here are the 2 queries:
Using a filter
POST /backup/entity/_search?routing=39cd0b95-efc3-4eee-93d1-93e6f5837d6b
{
"query": {"bool":{"should":[],"must":[{"match_all":{}}]}},
"filter": {
"bool": {
"must": [
{
"term": {
"serviceId": "39cd0b95-efc3-4eee-93d1-93e6f5837d6b"
}
},
{
"term": {
"subscriptionId": "3eb5021e-2f1d-4292-9fd5-95788ebfafa0"
}
},
{
"term": {
"subscriptionType": 0
}
},
{
"terms": {
"entityType": [
"4"
]
}
}
]
}
}
}
Using a query
POST /backup/entity/_search?routing=39cd0b95-efc3-4eee-93d1-93e6f5837d6b
{
"query": {"bool":{"should":[],"must":[
{
"term": {
"serviceId": "39cd0b95-efc3-4eee-93d1-93e6f5837d6b"
}
},
{
"term": {
"subscriptionId": "3eb5021e-2f1d-4292-9fd5-95788ebfafa0"
}
},
{
"term": {
"subscriptionType": 0
}
},
{
"terms": {
"entityType": [
"4"
]
}
}
]}}
}
Like I said, the second method where I don't use a Filter at all takes mere milliseconds, while the first query takes almost 4 seconds. This seems completely backwards from what the documentation says. They say that the Filter should actually be very quick and the Query should be the one that takes longer. So why am I seeing the exact opposite here?
Could it be something with my index mapping? If anyone has any idea why this is happening I would love to hear suggestions.
Thanks
The root filter element is actually another name for post_filter element. Somehow, it was supposed to be removed (the filter) in ES 1.1 but it slipped through and exists in 2.x versions as well.
It is removed completely in ES 5 though.
So, your first query is not a "filter" query. It's a query whose results are used afterwards (if applicable) in aggregations, and then the post_filter/filter is applied on the results. So you basically have a two steps process in there: https://www.elastic.co/guide/en/elasticsearch/reference/1.5/search-request-post-filter.html
More about its performance here:
While we have gained cacheability of the tag filter, we have potentially increased the cost of scoring significantly. Post filters are useful when you need aggregations to be unfiltered, but hits to be filtered. You should not be using post_filter (or its deprecated top-level synonym filter) if you do not have facets or aggregations.
A proper filter query is the following:
{
"query": {
"filtered": {
"query": {
"bool": {
"should": [],
"must": [
{
"match_all": {}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"term": {
"serviceId": "39cd0b95-efc3-4eee-93d1-93e6f5837d6b"
}
},
{
"term": {
"subscriptionId": "3eb5021e-2f1d-4292-9fd5-95788ebfafa0"
}
},
{
"term": {
"subscriptionType": 0
}
},
{
"terms": {
"entityType": [
"4"
]
}
}
]
}
}
}
}
}
A filter is faster. Your problem is that you include the match_all query in your filter case. This matches on all 2 billion of your documents. A set operation has to then be done against the filter to cull the set. Omit the query portion in your filter test and you'll see that the results are much faster.

Can _score from different queries be compared?

In my application, I issue multiple queries, each of which to a different index. Then, I merge the results from these queries, and sort them using the _score attribute, in order to rank them according to their relavance. But I wonder if this makes sense at all, since the results came from different queries?
I guess my question is: can _scores from different queries be compared?
Instead of issuing multiple queries , it would be a good idea to club them together in a single query.
You can use index query to do index specefic operation.
So something like
{
"bool": {
"should": [
{
"indices": {
"indices": [
"index1"
],
"query": {
"term": {
"tag": "wow"
}
}
}
},
{
"indices": {
"indices": [
"index2"
],
"query": {
"term": {
"name": "laptop"
}
}
}
}
]
}
}
Once this is done , results would be sorted based on the _score.
Hope that helps.

Resources