Elasticsearch score from 0 to 1 for searching similar documents to the one that exists - elasticsearch

Need to calculate relative score from 0 to 1 when searching similar documents to existing one?
So existing one has score 1, and all other matching documents score should be calculated according to this and score will be <= 1. But existing document should be excluded from the search. Is it possible to do it on elasticsearch side, not just calculating score manually in a programming language like:
match_doc_score/search_doc_score
Let's imagine we have index person with mapping:
{
"properties": {
"person_id": {
"type": "keyword"
},
"fullname": {
"type": "text"
},
"email": {
"type": "keyword"
},
"phone": {
"type": "keyword"
},
"country_of_birth": {
"type": "keyword"
}
}
}
And I have 3 persons inside the index:
Person 1:
{
"person_id": 1,
"fullname": "John Snow",
"email": "john#gmail.com",
"phone": "111-11-11",
"country_of_birth": "Denmark"
}
Person 2:
{
"person_id": 2,
"fullname": "Snow John",
"email": "john#gmail.com",
"phone": "222-22-22",
"country_of_birth": "Denmark"
}
Person 3:
{
"person_id": 3,
"fullname": "Peter Wislow",
"email": "peter#gmail.com",
"phone": "111-11-11",
"country_of_birth": "Denmark"
}
We find persons that are similar to Person 1 by this query:
{
"query": {
"bool": {
"should": [
{
"match": {
"fullname": {
"query": "John Snow",
"boost": 6
}
}
},
{
"term": {
"email": {
"value": "john#gmail.com",
"boost": 5
}
}
},
{
"term": {
"phone": {
"value": "111-11-11",
"boost": 4
}
}
},
{
"term": {
"country_of_birth": {
"value": "Denmark",
"boost": 2
}
}
}
],
"must_not": [
{
"term": {
"person_id": 123
}
}
]
}
}
}
As you can see:
person 1 and person 2 match by: fullname, email, country of birth.
person 1 and person 3 match by: phone, country of birth.
Is it possible to have 0..1 scoring if we have order with full match in the index(person 1)?
I know there is a more_like_this query, but in real life search queries can be complicated so more_like_this is not a good option. Even elasticsearch documentation says that if you need more control over the query, then use boolean query combinations.

Have not tried but looks like field value factor of function score might solve your query.

Related

Group by terms and get count of nested array property?

I would like to get the count from a document series where an array item matches some value.
I have documents like these:
{
"Name": "jason",
"Todos": [{
"State": "COMPLETED"
"Timer": 10
},{
"State": "PENDING"
"Timer": 5
}]
}
{
"Name": "jason",
"Todos": [{
"State": "COMPLETED"
"Timer": 5
},{
"State": "PENDING"
"Timer": 2
}]
}
{
"Name": "martin",
"Todos": [{
"State": "COMPLETED"
"Timer": 15
},{
"State": "PENDING"
"Timer": 10
}]
}
I would like to count how many documents I have where they have any Todos with COMPLETED State. And group by Name.
So from the above I would need to get:
jason: 2
martin: 1
Usually I do this with a term aggregation for the Name, and an other sub aggregation for other items:
"aggs": {
"statistics": {
"terms": {
"field": "Name"
},
"aggs": {
"test": {
"filter": {
"bool": {
"must": [{
"match_phrase": {
"SomeProperty.keyword": {
"query": "THEVALUE"
}
}
}
]
}
},
But not sure how to do this here as I have items in an array.
Elasticsearch has no problem with arrays because in fact it flattens them by default:
Arrays of inner object fields do not work the way you may expect. Lucene has no concept of inner objects, so Elasticsearch flattens object hierarchies into a simple list of field names and values.
So a query like the one you posted will do. I would use term query for keyword datatype, though:
POST mytodos/_search
{
"size": 0,
"aggs": {
"by name": {
"terms": {
"field": "Name"
},
"aggs": {
"how many completed": {
"filter": {
"term": {
"Todos.State": "COMPLETED"
}
}
}
}
}
}
}
I am assuming your mapping looks something like this:
PUT mytodos/_mappings
{
"properties": {
"Name": {
"type": "keyword"
},
"Todos": {
"properties": {
"State": {
"type": "keyword"
},
"Timer": {
"type": "integer"
}
}
}
}
}
The example documents that you posted will be transformed internally into something like this:
{
"Name": "jason",
"Todos.State": ["COMPLETED", "PENDING"],
"Todos.Timer": [10, 5]
}
However, if you need to query for Todos.State and Todos.Timer, for example, filter for those "COMPLETED" but only with Timer > 10, it will not be possible with such mapping because Elasticsearch forgets the link between fields of object array items.
In this case you would need to use something like nested datatype for such arrays, and query them with special nested query.
Hope that helps!

Elasticsearch query fails to return results when querying a nested object

I have an object which looks something like this:
{
"id": 123,
"language_id": 1,
"label": "Pablo de la Pena",
"office": {
"count": 2,
"data": [
{
"id": 1234,
"is_office_lead": false,
"office": {
"id": 1,
"address_line_1": "123 Main Street",
"address_line_2": "London",
"address_line_3": "",
"address_line_4": "UK",
"address_postcode": "E1 2BC",
"city_id": 1
}
},
{
"id": 5678,
"is_office_lead": false,
"office": {
"id": 2,
"address_line_1": "77 High Road",
"address_line_2": "Edinburgh",
"address_line_3": "",
"address_line_4": "UK",
"address_postcode": "EH1 2DE",
"city_id": 2
}
}
]
},
"primary_office": {
"id": 1,
"address_line_1": "123 Main Street",
"address_line_2": "London",
"address_line_3": "",
"address_line_4": "UK",
"address_postcode": "E1 2BC",
"city_id": 1
}
}
My Elasticsearch mapping looks like this:
"mappings": {
"item": {
"properties": {
"office": {
"properties": {
"data": {
"type": "nested",
}
}
}
}
}
}
My Elasticsearch query looks something like this:
GET consultant/item/_search
{
"from": 0,
"size": 24,
"query": {
"bool": {
"must": [
{
"term": {
"language_id": 1
}
},
{
"term": {
"office.data.office.city_id": 1
}
}
]
}
}
}
This returns zero results, however, if I remove the second term and leave it only with the language_id clause, then it works as expected.
I'm sure this is down to a misunderstading on my part of how the nested object is flattened, but I'm out of ideas - I've tried all kinds of permutations of the query and mappings.
Any guidance hugely appreciated. I am using Elasticsearch 6.1.1.
I'm not sure if you need the entire record or not, this solution gives every record that has language_id: 1 and has an office.data.office.id: 1 value.
GET consultant/item/_search
{
"from": 0,
"size": 100,
"query": {
"bool":{
"must": [
{
"term": {
"language_id": {
"value": 1
}
}
},
{
"nested": {
"path": "office.data",
"query": {
"match": {
"office.data.office.city_id": 1
}
}
}
}
]
}
}
}
I put 3 different records in my test index for proofing against false hits, one with different language_id and one with different office ids and only the matching one returned.
If you only need the office data, then that's a bit different but still solvable.

Extract record from multiple arrays based on a filter

I have documents in ElasticSearch with the following structure :
"_source": {
"last_updated": "2017-10-25T18:33:51.434706",
"country": "Italia",
"price": [
"€ 139",
"€ 125",
"€ 120",
"€ 108"
],
"max_occupancy": [
2,
2,
1,
1
],
"type": [
"Type 1",
"Type 1 - (Tag)",
"Type 2",
"Type 2 (Tag)",
],
"availability": [
10,
10,
10,
10
],
"size": [
"26 m²",
"35 m²",
"47 m²",
"31 m²"
]
}
}
Basically, the details records are split in 5 arrays, and fields of the same record have the same index position in the 5 arrays. As can be seen in the example data there are 5 array(price, max_occupancy, type, availability, size) that are containing values related to the same element. I want to extract the element that has max_occupancy field greater or equal than 2 (if there is no record with 2 grab a 3 if there is no 3 grab a four, ...), with the lower price, in this case the record and place the result into a new JSON object like the following :
{
"last_updated": "2017-10-25T18:33:51.434706",
"country": "Italia",
"price: ": "€ 125",
"max_occupancy": "2",
"type": "Type 1 - (Tag)",
"availability": 10,
"size": "35 m²"
}
Basically the result structure should show the extracted record(that in this case is the second index of all array), and add the general information to it(fields : "last_updated", "country").
Is it possible to extract such a result from elastic search? What kind of query do I need to perform?
Could someone suggest the best approach?
My best approach: go nested with Nested Datatype
Except for easier querying, it easier to read and understand the connections between those objects that are, currently, scattered in different arrays.
Yes, if you'll decide this approach you will have to edit your mapping and re-index your entire data.
How would the mapping is going to look like? something like this:
{
"mappings": {
"properties": {
"last_updated": {
"type": "date"
},
"country": {
"type": "string"
},
"records": {
"type": "nested",
"properties": {
"price": {
"type": "string"
},
"max_occupancy": {
"type": "long"
},
"type": {
"type": "string"
},
"availability": {
"type": "long"
},
"size": {
"type": "string"
}
}
}
}
}
}
EDIT: New document structure (containing nested documents) -
{
"last_updated": "2017-10-25T18:33:51.434706",
"country": "Italia",
"records": [
{
"price": "€ 139",
"max_occupancy": 2,
"type": "Type 1",
"availability": 10,
"size": "26 m²"
},
{
"price": "€ 125",
"max_occupancy": 2,
"type": "Type 1 - (Tag)",
"availability": 10,
"size": "35 m²"
},
{
"price": "€ 120",
"max_occupancy": 1,
"type": "Type 2",
"availability": 10,
"size": "47 m²"
},
{
"price": "€ 108",
"max_occupancy": 1,
"type": "Type 2 (Tag)",
"availability": 10,
"size": "31 m²"
}
]
}
Now, its more easy to query for any specific condition with Nested Query and Inner Hits. for example:
{
"_source": [
"last_updated",
"country"
],
"query": {
"bool": {
"must": [
{
"term": {
"country": "Italia"
}
},
{
"nested": {
"path": "records",
"query": {
"bool": {
"must": [
{
"range": {
"records.max_occupancy": {
"gte": 2
}
}
}
]
}
},
"inner_hits": {
"sort": {
"records.price": "asc"
},
"size": 1
}
}
}
]
}
}
}
Conditions are: Italia AND max_occupancy > 2.
Inner hits: sort by price ascending order and get the first result.
Hope you'll find it useful

How to get the number of hits of several matching fields in one record?

I have records similar to
{
"who": "John",
"hobby": [
{"name": "gardening",
"skills": 2
},
{"name": "sleeping",
"skills": 3
},
{"name": "darts",
"skills": 2
}
]
}
,
{
"who": "Mary",
"hobby": [
{"name": "gardening",
"skills": 2
},
{"name": "volleyball",
"skills": 3
},
{"name": "kung-fu",
"skills": 2
}
]
}
I am looking at building a query which would answer the question: "how many hobbies with skills=2 do we have?"
The answer for the example above would be 3 ("gardening" is common to both, and each have another unique one).
Every "query" or "query"+"aggs" I tried returns in ['hits']['hits'] or ['aggregations']['sources']['buckets'] the number of matching documents, that is two in the case above (one for "John" and one for "Mary", each of them satisfying the query).
Is there a way to build a query so that it returns the total number of fields (in the example above: the elements of the list "hobby") which matched that query? (fields, not documents)
Note: If my documents were flat:
{"who": "John", "name": "gardening", "skills": 2},
{"who": "John", "name": "sleeping", "skills": 3},
(...)
{"who": "Mary", "name": "kung-fu", "skills": 2}
then a simple "query" to match "skills": 2 + an aggregation on "name" would have done the work
Yes, you can achieve this with the nested type and using inner_hits and/or nested aggregations.
So here is the mapping you should use:
curl -XPUT localhost:9200/hobbies -d '{
"mappings": {
"hob": {
"properties": {
"who": {
"type": "string"
},
"hobby": {
"type": "nested", <--- the hobby list is of type nested
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"skills": {
"type": "integer"
}
}
}
}
}
}
}
Then we can insert your two sample documents using the _bulk endpoint like this:
curl -XPOST localhost:9200/hobbies/hob/_bulk -d '
{"index":{}}
{"who":"John", "hobby":[{"name": "gardening","skills": 2},{"name": "sleeping","skills": 3},{"name": "darts","skills": 2}]}
{"index":{}}
{"who":"Mary", "hobby":[{"name": "gardening","skills": 2},{"name": "volley-ball","skills": 3},{"name": "kung-fu","skills": 2}]}
'
And finally, we can query your index for how many hobbies have skills: 2 like this:
curl -XPOST localhost:9200/hobbies/hob/_search -d '{
"_source": false,
"query": {
"nested": {
"path": "hobby",
"query": {
"term": {
"hobby.skills": 2
}
},
"inner_hits": {} <---- this will return only the matching nested fields with skills=2
}
},
"aggs": {
"hobbies": {
"nested": {
"path": "hobby"
},
"aggs": {
"skills": {
"filter": {
"term": {
"hobby.skills": 2
}
},
"aggs": {
"by_field": { <--- this will return a breakdown of the fields with skills=2
"terms": {
"field": "name"
}
}
}
}
}
}
}
}'
What this query will return you is
In the hits part, the four fields that have skills: 2
In the aggs part, a breakdown of the 3 distinct fields which have skills: 2

Nested filtering in elasticsearch with more than one term of the same nested type

I'm new to elasticsearch, so maybe my approach is plain wrong, but I want to make an index of recipes and allow the user to filter it down with the aggregated ingredients that are still found in the subset.
Maybe I'm using the wrong language to explain so maybe this example will clarify. I would like to search for recipes with the term salt; which results in three recipes:
with ingredients: salt, flour, water
with ingredients: salt, pepper, egg
with ingredients: water, flour, egg, salt
The aggregate on the results ingredients returns salt, flour, water, pepper, egg. When I filter with flour I only want recipe 1 and 3 to appear in the search results (and the aggregate on ingredients should only return salt, flour, water, egg and salt). When I add another filter egg I want only recipe 3 to appear (and the aggregate should only return water, flour, egg, salt).
I can't make the latter to work: one filter next to the default query does narrow down the results as desired but when adding the other term (egg) to the terms filter the results again start to include b as well, as if it were an OR filter. Adding AND however to the filter execution results in NO results ... what am I doing wrong?
My mapping:
{
"recipe": {
"properties": {
"title": {
"analyzer": "dutch",
"type": "string"
},
"ingredients": {
"type": "nested",
"properties": {
"name": {
"type": "string",
"analyzer": "dutch",
"include_in_parent": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
My query:
{
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"match": {
"_all": "salt"
}
}
]
}
},
"filter": {
"nested": {
"path": "ingredients",
"filter": {
"terms": {
"ingredients.name": [
"flour",
"egg"
],
"execution": "and"
}
}
}
}
}
},
"size": 50,
"aggregations": {
"ingredients": {
"nested": {
"path": "ingredients"
},
"aggregations": {
"count": {
"terms": {
"field": "ingredients.name.raw"
}
}
}
}
}
}
Why are you using a nested mapping here? Its main purpose is to keep relations between the sub-object attributes, but your ingredients field has just one attribute and can be modeled simply as a string field.
So, if you update your mapping like this :
POST recipes
{
"mappings": {
"recipe": {
"properties": {
"title": {
"type": "string"
},
"ingredients": {
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
You can still index your recipes as :
{
"title":"recipe b",
"ingredients":["salt","pepper","egg"]
}
And this query gives you the result you are waiting for :
POST recipes/recipe/_search
{
"query": {
"filtered": {
"query": {
"match": {
"_all": "salt"
}
},
"filter": {
"terms": {
"ingredients": [
"flour",
"egg"
],
"execution": "and"
}
}
}
},
"size": 50,
"aggregations": {
"ingredients": {
"terms": {
"field": "ingredients"
}
}
}
}
which is :
{
...
"hits": {
"total": 1,
"max_score": 0.22295055,
"hits": [
{
"_index": "recipes",
"_type": "recipe",
"_id": "PP195TTsSOy-5OweArNsvA",
"_score": 0.22295055,
"_source": {
"title": "recipe c",
"ingredients": [
"salt",
"flour",
"egg",
"water"
]
}
}
]
},
"aggregations": {
"ingredients": {
"buckets": [
{
"key": "egg",
"doc_count": 1
},
{
"key": "flour",
"doc_count": 1
},
{
"key": "salt",
"doc_count": 1
},
{
"key": "water",
"doc_count": 1
}
]
}
}
}
Hope this helps.

Resources