Can Elasticsearch search by geo distance and other attributes at the same time? - elasticsearch

In Elasticsearch, when I search by geo-distance to a point, can I at the same time filter by another attribute, such as a number being within a range, so that both filters need to be true for the result to come back?

Sure, use bool query, where you can specify multiple clauses in must and (or) filter blocks. Be aware that clauses in must block will contribute to the relevance score and clauses in filter block will not (read more about query and filter context).
For example, query that at same time search by geo-distance with contribution to score and filter an age being within a range without contribution to score:
{
"query": {
"bool": {
"must": [
{
"geo_distance": {
"distance": "100km",
"pin.location": {
"lat": 38.889248,
"lon": -77.050636
}
}
}
],
"filter": [
{
"range": {
"age": {
"gte": 18,
"lte": 65
}
}
}
]
}
}
}

Related

ElasticSearch function score query (range filter)

I want to use document scoring instead of filtering.
As a user I can enter something like buyingPrice (from-to) 50-150€.
This works well with origin,offset,scale - e.g.:
gauss:{
buyingPrice:{
origin:100€
offset:100€
scale:200€
}
}
}
Problem is now, when a user only enters one side - e.g. from 50€
Expected behavior would be, that all buyingPrices above 50€ get full score. The ones below 50€ get a score lower than the full one.
How can I achieve that with ElasticSearch?
You can add a filter inside function score, so function score will only affect those documents
{
"query": {
"function_score": {
"functions": [
{...}, --> other functions
{
"filter": {
"range": {
"price": {
"lte": 50
}
}
},
"gauss": {
"price": {
"origin": 50,
"offset": 0,
"scale": 200
}
}
}
]
}
}

Elasticsearch Query for good title keyword results

We have a elasticsearch index containing a catalog of products, that we want to search by title and description.
We want it to have the following constraints:
We are searching title and description for occurences (matches in title should be twice as important as description)
We want it to have a very light fuzzy search result (but still accurate results)
Not matching results to the searchterm should not be filtered out, but only shown later (so matching results should be on top and worse results should be at the bottom)
category_id should filter products out (so no results of other categories should be shown)
The created_at attribute should be valued very high in sorting as well.
products should lose score the "older" they get. (This is very important, because they lose importance with every day)
I have tried to create a query like that, but the results are really less than accurate. Sometimes finding completely unrelated stuff. I think that's because of the wildcard query.
Also I think there must be a more elegant solution for the "created_at" scoring. Right?
I am using Elasticsearch 6.2
This is my current code. I would be happy to see an elegant solution for this:
{
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"min_score": 0.3,
"size": 12,
"from": 0,
"query": {
"bool": {
"filter": {
"terms": {
"category_id": [
"212",
"213"
]
}
},
"should": [
{
"match": {
"title_completion": {
"query": "Development",
"boost": 20
}
}
},
{
"wildcard": {
"title": {
"value": "*Development*",
"boost": 1
}
}
},
{
"wildcard": {
"title_completion": {
"value": "*Development*",
"boost": 10
}
}
},
{
"match": {
"title": {
"query": "Development",
"operator": "and",
"fuzziness": 1
}
}
},
{
"range": {
"created_at": {
"gte": 1563264817998,
"boost": 11
}
}
},
{
"range": {
"created_at": {
"gte": 1563264040398,
"boost": 4
}
}
},
{
"range": {
"created_at": {
"gte": 1563256264398,
"boost": 1
}
}
}
]
}
}
}
First of all, building a request returning relevant results is usually a difficult task. It can't be done without knowing the content of the documents. That said, I can give you hints to fulfill your requirements and avoid unrelevant results.
We are searching title and description for occurences (matches in title should be twice as important as description)
You can use boost as you did in your query to give more importance to matches on title compared to description.
We want it to have a very light fuzzy search result (but still accurate results)
You should use AUTO value for the fuzzy field to define different values of fuzziness depending on the length of the term. E.g., by default terms having less than 3 letters (most common terms where a change in letter can result in a different word) will not allows changes. Terms with more than 3 letters will allow one change and more than 5 will allow 2 changes. You can change this behavior depending of your tests.
Not matching results to the searchterm should not be filtered out, but only shown later (so matching results should be on top and worse results should be at the bottom)
Use a should clause in the bool statement. Clauses in a should statements does not filter documents (unless specified otherwise). The queries in should clause are only used to increase the score.
category_id should filter products out (so no results of other categories should be shown)
Use a must of filter clause in the bool statement to ensure that all documents validate a constraint. If you don't want the subqueries to contribute to the score (I believe its your case), use filter instead of match because filter will be able to cache the results. Your query is ok for this requirement.
The created_at attribute should be valued very high in sorting as well. products should lose score the "older" they get. (This is very important, because they lose importance with every day)
You should use a function score with a decay function. If decay function are not clear for you, you can skip the equations in the document and jump to the figure which self explanatory. The following query is an example using a gauss decay function.
{
"function_score": {
// Name of the decay function
"gauss": {
// Field to use
"created_at": {
"origin": "now", // "now" is the default so you can omit this field
"offset": "1d", // Values with less than 1 day will not be impacted
"scale": "10d", // Duration for which the scores will be scaled using a gauss function
"decay" : 0.01 // Score for values further than scale
}
}
}
}
Hints for writing queries
Avoid wildcard queries: If you use * they are not efficient and will consume a lot of memory. If you want to be able to search in part of a term (finding "penthouse" when the user search "house") you should create a subfield using ngram tokenizer and write a standard match query using the subfield.
Avoid setting a minimum score: The score is a relative value. A small score or a high score does not mean that the document is relevant or not. You can read this article about the subject.
Be carefull with fuzzy queries: Fuzzy can generate a lot of noise and confuse users. In general, I would recommend to increase the default AUTO threshold for fuzzy and accept that some queries with mispelling does not return good results. Usually, it is simpler for a user to detect a mispelling in his input compared to understanding why he has completly unrelated results.
Example query
This is just an example that you will need to adapt with your data.
{
"size": 12,
"query": {
"bool": {
"filter": {
"terms": {
"category_id": <CATEGORY_IDS>
}
},
"should": [
{
"match": {
"title": {
"query": <QUERY>,
"fuzziness": AUTO:4:12,
"boost": 3
}
}
},
{
"match": {
"title_completion": {
"query": <QUERY>,
"boost": 1
}
}
},
{
"match": {
// title_completion field with ngram tokenizer
"title_completion.ngram": {
"query": <QUERY>,
// Use lower boost because it match only partially
"boost": 0.5
}
}
}
]
},
"function_score": {
// Name of the decay function
"gauss": {
// Field to use
"created_at": {
"origin": "now", // "now" is the default so you can omit this field
"offset": "1d", // Values with less than 1 day will not be impacted
"scale": "10d", // Duration for which the scores will be scaled using a gauss function
"decay" : 0.01 // Score for values further than scale
}
}
}
}
}

Elasticsearch - retriving documents only, if multiple match by specific field

I have an index in Elasticsearch with users' posts. I want to retrieve user_id from this index, if for given date range, there are at least X posts. Otherwise to skip such posts.
Anyway I can achieve it in ES or I have to get all entities and handle them later?
Trawa ;)
To answer your question I'll assume you have the fields user and datetime in your mapping.
You can get the requested data like so:
Get the list of users who have more then X (i.e X=100) posts by given date range - aggregate by user name for specific date range:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"datetime": {
"gte": "2017-05-01",
"lt": "2017-06-01"
}
}
}
]
}
},
"aggregations": {
"users": {
"terms": {
"field": "user",
"min_doc_count": 100
}
}
}
}
Edit the query to match your date range (and its format) and min_doc_count to the minimum X posts per user.
EDIT:
There is no way to avoid terms_aggregation to get all distinct values.
50k values do seems to be to much data to retrieve - but it also depends on your cluster.
My suggestion is to add another filter, lets say, alphabetically filter so instead of getting 50k results at once you can do it in other several queries:
"must": [
{
"range": {
"datetime": {
"gte": "2017-05-01",
"lt": "2017-06-01"
}
}
},
{
"wildcard": {
"user": "a*"
}
},
{
"wildcard": {
"user": "b*"
}
}
]
See Wildcard
Unfortunately, scrolling on aggregation results is not available. Manually dividing the data to pieces is the best thing I can see right now.

Elastic: refer to a calculated metric value inside the filter of another aggregation

I'm wondering if it is possible to refer to a computed metric value (I'm calculating the median of price in my documents) inside the filter of another aggregation.
Specifically, I know that I can calculate the median like this:
"aggs":{
"median": {
"percentiles" : {
"field" : "price",
"percents": [50]
}
},
...
}
But now can I refer to this value inside another aggregation, like this:
"aggs": {
"exact": {
"filter": {
"bool": {
"must": [
{
"range": {
"price": {
"gte": 1000,
"lte": median
}
}
}
]
}
}
},
...
}
Please let me know if I can provide any more details. I've been reading Elastic docs all day and it seems like I could do it with some combination of scripting and pipeline aggregations, but I haven't figured it out yet.
Thanks in advance.

Is there a way to have elasticsearch return a hit per generated bucket during an aggregation?

right now I have a query like this:
{
"query": {
"bool": {
"must": [
{
"match": {
"uuid": "xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxxxxx"
}
},
{
"range": {
"date": {
"from": "now-12h",
"to": "now"
}
}
}
]
}
},
"aggs": {
"query": {
"terms": [
{
"field": "query",
"size": 3
}
]
}
}
}
The aggregation works perfectly well, but I can't seem to find a way to control the hit data that is returned, I can use the size parameter at the top of the dsl, but the hits that are returned are not returned in the same order as the bucket so the bucket results do not line up with the hit results. Is there any way to correct this or do I have to issue 2 separate queries?
To expand on Filipe's answer, it seems like the top_hits aggregation is what you are looking for, e.g.
{
"query": {
... snip ...
},
"aggs": {
"query": {
"terms": {
"field": "query",
"size": 3
},
"aggs": {
"top": {
"top_hits": {
"size": 42
}
}
}
}
}
}
Your query uses exact matches (match and range) and binary logic (must, bool) and thus should probably be converted to use filters instead:
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"uuid": "xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxxxxx"
}
},
{
"range": {
"date": {
"from": "now-12h",
"to": "now"
}
}
}
]
}
}
As for the aggregations,
The hits that are returned do not represent all the buckets that were returned. so if have buckets for terms 'a', 'b', and 'c' I want to have hits that represent those buckets as well
Perhaps you are looking to control the scope of the buckets? You can make an aggregation bucket global so that it will not be influenced by the query or filter.
Keep in mind that Elasticsearch will not "group" hits in any way -- it is always a flat list ordered according to score and additional sorting options.
Aggregations can be organized in a nested structure and return computed or extracted values, in a specific order. In the case of terms aggregation, it is in descending count (highest number of hits first). The hits section of the response is never influenced by your choice of aggregations. Similarly, you cannot find hits in the aggregation sections.
If your goal is to group documents by a certain field, yes, you will need to run multiple queries in the current Elasticsearch release.
I'm not 100% sure, but I think there's no way to do that in the current version of Elasticsearch (1.2.x). The good news is that there will be when version 1.3.x gets released:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html

Resources