ElasticSearch order by number of matches in nested fields - elasticsearch

Complete beginner here, quite possibly trying to do the impossible.
I have the following structure that I would like to store in Elasticsearch:
{
"id" : 1,
"code" : "03f3301c-4089-11e7-a919-92ebcb67fe33",
"countries" : [
{
"id" : 1,
"name" : "Netherlands"
},
{
"id" : 2,
"name" : "United Kingdom"
}
],
"tags" : [
{
"id" : 1,
"name" : "Scanned"
},
{
"id" : 2,
"name" : "Secured"
},
{
"id" : 3,
"name" : "Cleared"
}
]
}
I have complete control over how it will be stored, so the structure can change, but it should contain all these fields in some form.
I’d like to be able to query this data over countries and tags in such way that all those items having at least one match are returned, ordered by number of matches. If at all possible I’d prefer not to do a full text search.
For example:
id, code, country ids, tag ids
1, ..., [1, 2, 3], [1]
2, ..., [1], [1, 2, 3]
For the question: "which of these was in country 1 or has tag 1 or has tag 2", should return:
2, ..., [1], [1, 2, 3]
1, ..., [1, 2, 3], [1]
In this order, because the second row matches more sub-queries in the above disjunction.
In essence, I’d like to replicate this SQL query:
SELECT p.id, p.code, COUNT(p.id) FROM packages p
LEFT JOIN tags t ON t.package_id = p.id
LEFT JOIN countries c ON c.package_id = p.id
WHERE t.id IN (1, 2, 3) OR c.id IN (1, 2, 3)
GROUP BY p.id
ORDER BY COUNT(p.id);
I’m using ElasticSearch 2.4.5 if that matters.
Hopefully I was clear enough. Thank you for your help!

You need countries and tags to be of type nested. Also, you need to take control of the scoring with function_score give a weight of 1 for the queries inside the function_score and also play with boost_mode and score_mode. In the end you can use this query:
GET /nested/test/_search
{
"query": {
"function_score": {
"query": {
"match_all": {}
},
"functions": [
{
"filter": {
"nested": {
"path": "tags",
"query": {
"term": {
"tags.id": 1
}
}
}
},
"weight": 1
},
{
"filter": {
"nested": {
"path": "tags",
"query": {
"term": {
"tags.id": 2
}
}
}
},
"weight": 1
},
{
"filter": {
"nested": {
"path": "countries",
"query": {
"term": {
"countries.id": 1
}
}
}
},
"weight": 1
}
],
"boost_mode": "replace",
"score_mode": "sum"
}
}
}
For a more complete test case, I am also providing the mapping and test data:
PUT nested
{
"mappings": {
"test": {
"properties": {
"tags": {
"type": "nested",
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
}
}
},
"countries": {
"type": "nested",
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
POST nested/test/_bulk
{"index":{"_id":1}}
{"name":"Foo Bar","tags":[{"id":2,"name":"My Tag 5"},{"id":3,"name":"My Tag 7"}],"countries":[{"id":1,"name":"USA"}]}
{"index":{"_id":2}}
{"name":"Foo Bar","tags":[{"id":3,"name":"My Tag 6"}],"countries":[{"id":1,"name":"USA"},{"id":2,"name":"UK"},{"id":3,"name":"UAE"}]}
{"index":{"_id":3}}
{"name":"Foo Bar","tags":[{"id":1,"name":"My Tag 4"},{"id":3,"name":"My Tag 1"}],"countries":[{"id":3,"name":"UAE"}]}
{"index":{"_id":4}}
{"name":"Foo Bar","tags":[{"id":1,"name":"My Tag 1"},{"id":2,"name":"My Tag 4"},{"id":3,"name":"My Tag 2"}],"countries":[{"id":2,"name":"UK"},{"id":3,"name":"UAE"}]}

Related

Sorting by a nested field in elasticsearch

If I had a data structure that looked like this
[{"_id" 1
"scores" [{"student_id": 1, "score": 100"}, {"student_id": 2, "score": 80"}
]},
{"_id" 2
"scores" [{"student_id": 1, "score": 20"}, {"student_id": 2, "score": 90"}
]}]
Would it be possible to sort this dataset by student_1's score or by student_2's score?
For example if I sorted descending by student 1's score, I would get document 1,2, but if I sorted descending by student 2's score, I would get 2,1.
I could re-arrange the data, but I don't want to use another index because there's a bunch of metadata not included above for brevity. Thanks!
Yes, it is possible. You must use "nested" field type for your scores, that way you can keep the relation between each student_id and its score.
You can read an article I wrote about that subject:
https://opster.com/guides/elasticsearch/data-architecture/elasticsearch-nested-field-object-field/
Now the example:
Mappings
PUT test_students
{
"mappings": {
"properties": {
"scores": {
"type": "nested",
"properties": {
"student_id": {
"type": "keyword"
},
"score": {
"type": "long"
}
}
}
}
}
}
Documents
PUT test_students/_doc/1
{
"scores": [{"student_id": 1, "score": 100}, {"student_id": 2, "score": 80}]
}
PUT test_students/_doc/2
{
"scores": [{"student_id": 1, "score": 20}, {"student_id": 2, "score": 90}]
}
Query
POST test_students/_search
{
"sort" : [
{
"scores.score" : {
"mode" : "max",
"order" : "desc",
"nested": {
"path": "scores",
"filter": {
"term" : { "scores.student_id" : "2" }
}
}
}
}
]
}

How to filter query based on a field value

I'm working with elasticsearch Query dsl, and I can't find a way to express the following:
Return results that have the field "price" > min budget and have "price" < max Budget and have has_price=true and also return all results that have "has_price=false"
In other words, I would like to use a range filter on results only that have has_price field set to true, otherwise, on results that have has_price set to false don't take in consideration the filter
Here's the mapping:
{
"formations": {
"mappings": {
"properties": {
"code": {
"type": "text"
},
"date": {
"type": "date",
"format": "dd/MM/yyyy"
},
"description": {
"type": "text"
},
"has_price": {
"type": "boolean"
},
"place": {
"type": "text"
},
"price": {
"type": "float"
},
"title": {
"type": "text"
}
}
}
}
}
The following query combines the 2 scenarios as 2 should clauses in a bool-query. And as there are only should clauses, minimum_should_match will be 1, meaning that at least one should-clause has to match:
Abstract Code Snippet
GET formations/_search
{
"query": {
"bool": {
"should": [
{ <1st scenario: has_price = false> },
{ <2nd scenario> has_price = true AND price IN budget_range}
]
}
}
}
Actual Sample Code Snippets
# 1. Create the index and populate it with some sample documents
POST formations/_bulk
{"index": {"_id": 1}}
{"has_price": true, "price": 2.0}
{"index": {"_id": 2}}
{"has_price": true, "price": 3.0}
{"index": {"_id": 3}}
{"has_price": true, "price": 4.0}
{"index": {"_id": 4}}
{"has_price": false, "price": 2.0}
{"index": {"_id": 5}}
{"has_price": false, "price": 3.0}
{"index": {"_id": 6}}
{"has_price": false, "price": 4.0}
# 2. Query assuming min_budget = 2.0 and max_budget = 4.0
GET formations/_search
{
"query": {
"bool": {
"should": [
{
"bool": {
"filter": {
"term": {
"has_price": false
}
}
}
},
{
"bool": {
"filter": [
{
"term": {
"has_price": true
}
},
{
"range": {
"price": {
"gt": 2,
"lt": 4
}
}
}
]
}
}
]
}
}
}
# 3. Result Snippet (4 hits: 3 from 1st scenario & 1 from 2nd scenario)
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
...
Don't forget to add the Claus "minimum_should_match": 1 to your bool-query in case you add another non-should-clause to your bool-query.
Let me know if this answers your question & solves your issue.

ElasticSearch - Fuzzy and strict match with multiple fields

We want to leverage ElasticSearch to find us similar objects.
Lets say I have an Object with 4 fields:
product_name, seller_name, seller_phone, platform_id.
Similar products can have different product names and seller names across different platforms (fuzzy match).
While, phone is strict and a single variation might cause yield a wrong record (strict match).
What were trying to create is a query that will:
Take into account all fields we have for current record and OR
between them.
Mandate platform_id is the one I want to specific look at. (AND)
Fuzzy the product_name and seller_name
Strictly match the phone number or ignore it in the OR between the fields.
If I would write it in pseudo code, I would write something like:
((product_name like 'some_product_name') OR (seller_name like
'some_seller_name') OR (seller_phone = 'some_phone')) AND (platform_id
= 123)
To do exact match on seller_phone i am indexing this field without ngram analyzers along with fuzzy_query for product_name and seller_name
Mapping
PUT index111
{
"settings": {
"analysis": {
"analyzer": {
"edge_n_gram_analyzer": {
"tokenizer": "whitespace",
"filter" : ["lowercase", "ednge_gram_filter"]
}
},
"filter": {
"ednge_gram_filter" : {
"type" : "NGram",
"min_gram" : 2,
"max_gram": 10
}
}
}
},
"mappings": {
"document_type" : {
"properties": {
"product_name" : {
"type": "text",
"analyzer": "edge_n_gram_analyzer"
},
"seller_name" : {
"type": "text",
"analyzer": "edge_n_gram_analyzer"
},
"seller_phone" : {
"type": "text"
},
"platform_id" : {
"type": "text"
}
}
}
}
}
Index documents
POST index111/document_type
{
"product_name":"macbok",
"seller_name":"apple",
"seller_phone":"9988",
"platform_id":"123"
}
For following pseudo sql query
((product_name like 'some_product_name') OR (seller_name like 'some_seller_name') OR (seller_phone = 'some_phone')) AND (platform_id = 123)
Elastic Query
POST index111/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"platform_id": {
"value": "123"
}
}
},
{
"bool": {
"should": [{
"fuzzy": {
"product_name": {
"value": "macbouk",
"boost": 1.0,
"fuzziness": 2,
"prefix_length": 0,
"max_expansions": 100
}
}
},
{
"fuzzy": {
"seller_name": {
"value": "apdle",
"boost": 1.0,
"fuzziness": 2,
"prefix_length": 0,
"max_expansions": 100
}
}
},
{
"term": {
"seller_phone": {
"value": "9988"
}
}
}
]
}
}]
}
}
}
Hope this helps

Exclude documents from aggregation

I am trying to get a filtered result set from my index.
{"group_id": 123, "type" : 1},
{"group_id": 123, "type" : 3},
{"group_id": 123, "type" : 2},
{"group_id": 423, "type" : 3},
{"group_id": 423, "type" : 1},
{"group_id": 231, "type" : 1}
Now I want to get all documents but exclude the ones with group_id that contains type = 2. So, in this case, I want to get all documents with group_id = 423 and group_id = 231, but exclude all documents with group_id = 123.
I was experimenting with filtered bool query:
{
"query": {
"bool": {
"must_not": [
{
"term": {
"type": 2
}
}
]
}
}
}
but that only excludes one document.
Any hints are welcome!
You can achieve this using two Elasticsearch search requests:
First, get all values of "group_id" for which corresponding value of "type" is 2. You need to use Terms Aggregation for this.
POST <index name>/<type name>/_search
{
"size": 0,
"query": {
"filtered": {
"filter": {
"term": {
"type": 2
}
}
}
},
"aggs": {
"group_ids_type_2": {
"terms": {
"field": "group_id",
"size": 0
}
}
}
}
Save the list of values of "group_id" fields received from the above request.
Now, use a query with must_not filter to get all documents such that the value of their "group_id" is not present in the list obtained above. You need to use Terms Filter here.
POST <index name>/<type name>/_search
{
"query": {
"bool": {
"must_not": [
{
"terms": {
"group_id": [
"123" <-- Replace this with a comma separated list of all group_id values received from first search request
]
}
}
]
}
}
}

Elasticsearch - bump individual result to the top

I'm working with Elasticsearch. I have an array of documents, and I'm trying to sort documents by the property price, except that I'd like a particular document to be the first result no matter what.
The below is what I'm using as my "sort" array as my attempt to order documents by ID 1213, and then all following documents ordered by price descending.
[
{
"id": {
"mode": "max",
"order": "desc",
"nested_filter": {
"term": {
"id": 1213
}
},
"missing": "_last"
}
},
{
"price": {
"order": "asc"
}
}
]
This doesn't appear to be working, though—document 1213 doesn't appear first. What am I doing wrong here?
As an example—the ideal returned result:
[{"id": 1213, "name": "Blue Sunglasses", "price": 12},
{"id": 1000, "name": "Green Sunglasses", "price": 2},
{"id": 1031, "name": "Purple Sunglasses", "price: 4},
{"id": 5923, "name": "Yellow Sunglasses, "price": 18}]
Instead, I get:
[{"id": 1000, "name": "Green Sunglasses", "price": 2},
{"id": 1031, "name": "Purple Sunglasses", "price: 4},
{"id": 1213, "name": "Blue Sunglasses", "price": 12},
{"id": 5923, "name": "Yellow Sunglasses, "price": 18}]
As others have already asked, what is the reason for the nested_filter?
There's many possible ways to do what you need. Here is one possible way which fits with the simple requirements you mentioned so far:
{
"query" : {
"custom_filters_score" : {
"query" : {
"match_all" : {}
},
"filters" : [
{
"filter" : {
"term" : {
"id" : "1213"
}
},
"boost" : 2
}
]
}
},
"sort" : [
"_score",
"price"
]
}
The assumption here is that your query is simple like the match_all query and does not affect the scores in anyway. If you do have something more complicated for the queries, to not affect the scores, you can try wrapping with a constant_score query. But ideally you get the document set you want where all the documents have the same score and then custom_filters_score query will boost the score of the document you want. You can do this for any number of documents adding further filters or if the documents are equal, use a terms filter. In the end the sort by the score and then the price.
In this case you need to use function_score to modify score of each doc.
{
"query": {
"function_score": {
"functions": [
{
"filter": {
"term": {
"id": "1213"
}
},
"weight": 1
},
{
"script_score": {
"script": "(1 / doc['price'].value)"
}
}
],
"score_mode": "sum",
"boost_mode" : "replace",
"query" : {
//YOUR QUERY GOES HERE
}
}
}
}
Explanation:
{
"script_score": {
"script": "(1 / doc['price'].value)"
}
}
Compute score based on price and give a value < 1. The higher the price the smaller the score (ascending). If you want to switch to descending then just replace it with
"script": "(1 - (1 / doc['price'].value))"
{
"filter": {
term": {
"id": "1213"
}
},
"weight": 1
}
This will give any docs with "id" = 1213 an extra 1 score. The total score at the end will be the sum of those 2 functions.

Resources