How to get the best matching document in Elasticsearch? - elasticsearch

I have an index where I store all the places used in my documents. I want to use this index to see if the user mentioned one of the places in the text query I receive.
Unfortunately, I have two documents whose name is similar enough to trick Elasticsearch scoring: Stockholm and Stockholm-Arlanda.
My test phrase is intyg stockholm and this is the query I use to get the best matching document.
{
"size": 1,
"query": {
"bool": {
"should": [
{
"match": {
"name": "intyig stockholm"
}
}
],
"must": [
{
"term": {
"type": {
"value": "4"
}
}
},
{
"terms": {
"name": [
"intyg",
"stockholm"
]
}
},
{
"exists": {
"field": "data.coordinates"
}
}
]
}
}
}
As you can see, I use a terms query to find the interesting documents and I use a match query in the should part of the root bool query to use scoring to get the document I want (Stockholm) on top.
This code worked locally (where I run ES in a container) but it broke when I started testing on a cluster hosted in AWS (where I have the exact same dataset). I found this explaining what happens and adding the search type argument actually fixes the issue.
Since the workaround is best not used on production, I'm looking for ways to have the expected result.
Here are the two documents:
// Stockholm
{
"type" : 4,
"name" : "Stockholm",
"id" : "42",
"searchableNames" : [
"Stockholm"
],
"uniqueId" : "Place:42",
"data" : {
"coordinates" : "59.32932349999999,18.0685808"
}
}
// Stockholm-Arlanda
{
"type" : 4,
"name" : "Stockholm-Arlanda",
"id" : "1832",
"searchableNames" : [
"Stockholm-Arlanda"
],
"uniqueId" : "Place:1832",
"data" : {
"coordinates" : "59.6497622,17.9237807"
}
}

Related

Custom ordering on elastic search

I'm executing a simple query which returns items matched by companyId.
In addition to only showing clients matching a specific company I also want records matching a certain location to appear at the top.So if somehow I pass through pseudo sort:"location=Johannesburg" it would return the data below and items which match the specific location would appear on top, followed by items with other locations.
Data:
{
"clientId" : 1,
"clientName" : "Name1",
"companyId" : 8,
"location" : "Cape Town"
},
{
"clientId" : 2,
"clientName" : "Name2",
"companyId" : 8,
"location" : "Johannesburg"
}
Query:
{
"query": {
"match": {
"companyId": "8"
}
},
"size": 10,
"_source": {
"includes": [
"firstName",
"companyId",
"location"
]
}
}
Is something like this possible in elastic and if so what is the name of this concept?(I'm not sure what to even Google for to solve this problem)
It can be done in different ways.
Simplest (if go only with text matching) is use bool query with should statement.
The bool query takes a more-matches-is-better approach, so the score from each matching must or should clause will be added together to provide the final _score for each document. Doc
Example:
{"query":
"bool": {
"must": [
"match": {
"companyId": "8"
}
],
"should": [
"match": {
"location": "Johannesburg"
}
]
}
}
}
More complex solution is to store GEO points in location, and use Distance feature query as example.

Using two separate relational indexes in elasticsearch is sense or not?

I have two relational mysql tables and i need to store these data to elasticsearch.
I stored like this and i wanted to ask you if there is a best way or not :
POST categories/_doc
{
"id" : 1,
"name" : "Phones"
}
POST categories/_doc
{
"id" : 2,
"name" : "TV"
}
PUT products
{
"mappings": {
"properties": {
"attributes": {
"type": "nested"
}
}
}
}
POST products/_doc
{
"id" : 3
"category_id" : 1
"name" : "IPhone 5S",
"attributes" : [
{
"color" : "red",
"stock" : 4
},
{
"color" : "blue",
"stock" : 2
}
]
}
POST products/_doc
{
"id" : 5
"category_id" : 2
"name" : "Samsung TV",
"attributes" : [
{
"color" : "red",
"stock" : 2
},
{
"color" : "yellow",
"stock" : 4
}
]
}
And i use two queries for searching :
I firstly search on categories index after that i send category id values to products index
GET products/_search
{
"query": {
"bool": {
"filter": [
{
"nested": {
"path": "attributes",
"query": {
"terms": {
"attributes.color": [
"red",
"blue"
]
}
},
"inner_hits": {}
}
},
{
"term": {
"category_id": 2
}
}
]
}
}
}
Can you please share your comments about this topic ?
Thank you in advance
It is not a good practice to do like this, since you will have a really difficult time doing sorting and pagination on your application. In Elasticsearch is really important to flat the data for performance reasons and store it in one index, if you have really separated content you can have them in multiple indices. Also keep these points in mind:
Elasticsearch is not a relational database. You will not be able to Join indices as you used to join Tables
Denormalization is not natural but is a key for efficiency in an Elasticsearch application.
Thinking of your data mapping at the early beginning will allow your
app to fly for years.

Elasticsearch query on a nested field with condition

Elasticsearch v7.0
Hello and good day!
I'm trying to create a query that will have a condition: if a nested field has only 1 element, get that first element, if a nested field has 2 more or elements, get a matching nested field condition
Scenario:
I have an index named socialmedia and has a nested field named cms which places a sentiment for that document
An example document of the cms field looks like this
"_id" : 1,
"cms" : [
{
"cli_id" : 0,
"cmx_sentiment" : "Negative"
}
]
This cms field contains "cli_id" : 0 by default for its 1st element (this means it is for all the clients/users to see) but sooner or later, it goes like this:
"_id": 1,
"cms" : [
{
"cli_id" : 0,
"cmx_sentiment" : "Negative"
},
{
"cli_id" : 1,
"cmx_sentiment" : "Positive"
},
{
"cli_id" : 2,
"cmx_sentiment" : "Neutral"
},
]
The 2nd and 3rd element shows that the clients with cli_id equals to 1 and 2 has made a sentiment for that document.
Now, I want to formulate a query that if the client who logged in has no sentiment yet for a specific document, it fetches the cmx_sentiment that has the "cli_id" : 0
BUT , if the client who has logged in has a sentiment for the fetched documents according to his filters, the query will fetch the cmx_sentiment that has the matching cli_id of the logged in client
for example:
the client who has a cli_id of 2, will get the cmx_sentiment of **Neutral** according to the given document above
the client who has a cli_id of 5, will get the cmx_sentiment of **Negative** because he hasn't given a sentiment to the document
PSEUDO CODE :
If a document has a sentiment indicated by the client, get the cmx_sentiment of the cli_id == to the client's ID
if a document is fresh or the client HAS NOT labeled yet a sentiment on that document, get the element's cmx_sentiment that has cli_id == 0
I'm in need of a query to condition for the pseudo code above
Here's my sample query:
"aggs" => [
"CMS" => [
"nested" => [
"path" => "cms",
],
"aggs" => [
"FILTER" => [
"filter" => [
"bool" => [
"should" => [
[
"match" => [
"cms.cli_id" => 0
]
],
[
"bool" => [
"must" => [
[
// I'm planing to create a bool method here to test if cli_id is equalis to the logged-in client's ID
]
]
]
]
]
]
],
"aggs"=> [
"TONALITY"=> [
"terms"=> [
"field" => "cms.cmx_sentiment"
],
]
]
]
]
]
]
Is my query correct?
The problem with the query I have provided, is that it SUMS all the elements, instead of picking one only
The query above provides this scenario:
The client with cli_id 2 logs in
Both the Neutral and Negative cmx_sentiment are being retrieved, instead of the Neutral alone
After the discussion with OP I'm rewriting this answer.
To get the desired result you will have to consider the following to build the query and aggregation:
Query:
This will contain any filter applied by logged in user. For the example purpose I'm using match_all since every document has atleast one nested doc against cms field i.e. for cli_id: 0
Aggregation:
Here we have to divide the aggregations into two:
default_only
sentiment_only
default_only
In this aggregation we find count for those document which don't have nested document for cli_id: <logged in client id>. i.e. only those docs which have nested doc for cli_id: 0.
To do this we follow the steps below:
default_only Use filter aggregation to get document which does not have nested document for cli_id: <logged in client id> i.e. using must_not => cli_id: <logged in client id>
default_nested : Add sub aggregation for nested docs since we need to get the docs against sentiment which is field of nested document.
sentiment_for_cli_id : Add sub aggregation to default_nested aggregation in order to get sentiment only for default client i.e. for cli_id: 0.
default : Add this terms sub aggregation to sentiment_for_cli_id aggregation to get counts against the sentiment. Note that this count is of nested docs and since you always have only one nested doc per cli_id therefore this count seems to be the count of docs but it is not.
the_doc_count: Add this reverse_nested aggregation to get out of nested doc aggs and the count of parent docs. We add this as the sub aggregation of default aggregation.
sentiment_only
This aggregation give count against each sentiment where cli_id: <logged in client id> is present. For this we follow the same approach as we followed for default_only aggregation. But with some tweaks as below:
sentiment_only : must => cli_id: <logged in client id>
sentiment_nested : same reason as above
sentiment_for_cli_id: same but instead of default we filter for cli_id: <logged in client id>
sentiment: same as default
the_doc_count: same as above
Example:
PUT socialmedia/_bulk
{"index":{"_id": 1}}
{"cms":[{"cli_id":0,"cmx_sentiment":"Positive"}]}
{"index":{"_id": 2}}
{"cms":[{"cli_id":0,"cmx_sentiment":"Positive"},{"cli_id":2,"cmx_sentiment":"Neutral"}]}
{"index":{"_id": 3}}
{"cms":[{"cli_id":0,"cmx_sentiment":"Positive"},{"cli_id":2,"cmx_sentiment":"Negative"}]}
{"index":{"_id": 4}}
{"cms":[{"cli_id":0,"cmx_sentiment":"Positive"},{"cli_id":2,"cmx_sentiment":"Neutral"}]}
Query:
GET socialmedia/_search
{
"query": {
"match_all": {}
},
"aggs": {
"default_only": {
"filter": {
"bool": {
"must_not": [
{
"nested": {
"path": "cms",
"query": {
"term": {
"cms.cli_id": 2
}
}
}
}
]
}
},
"aggs": {
"default_nested": {
"nested": {
"path": "cms"
},
"aggs": {
"sentiment_for_cli_id": {
"filter": {
"term": {
"cms.cli_id": 0
}
},
"aggs": {
"default": {
"terms": {
"field": "cms.cmx_sentiment"
},
"aggs": {
"the_doc_count": {
"reverse_nested": {}
}
}
}
}
}
}
}
}
},
"sentiment_only": {
"filter": {
"bool": {
"must": [
{
"nested": {
"path": "cms",
"query": {
"term": {
"cms.cli_id": 2
}
}
}
}
]
}
},
"aggs": {
"sentiment_nested": {
"nested": {
"path": "cms"
},
"aggs": {
"sentiment_for_cli_id": {
"filter": {
"term": {
"cms.cli_id": 2
}
},
"aggs": {
"sentiment": {
"terms": {
"field": "cms.cmx_sentiment"
},
"aggs": {
"the_doc_count": {
"reverse_nested": {}
}
}
}
}
}
}
}
}
}
}
}
Agg Output:
"aggregations" : {
"default_only" : {
"doc_count" : 1,
"default_nested" : {
"doc_count" : 1,
"sentiment_for_cli_id" : {
"doc_count" : 1,
"default" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Positive",
"doc_count" : 1,
"the_doc_count" : {
"doc_count" : 1
}
}
]
}
}
}
},
"sentiment_only" : {
"doc_count" : 3,
"sentiment_nested" : {
"doc_count" : 6,
"sentiment_for_cli_id" : {
"doc_count" : 3,
"sentiment" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Neutral",
"doc_count" : 2,
"the_doc_count" : {
"doc_count" : 2
}
},
{
"key" : "Negative",
"doc_count" : 1,
"the_doc_count" : {
"doc_count" : 1
}
}
]
}
}
}
}
}

Return distinct values in Elasticsearch

I am trying to solve an issue where I have to get distinct result in the search.
{
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "GEORGE",
"favorite_cars" : [ "honda","Hyundae" ]
}
When I perform a term query on favourite cars "ferrari". I get two results whose name is ABC. I simply want that the result returned should be one in this case. So my requirement will be if I can apply a distinct on name field to receive one 1 result.
Thanks
One way to achieve what you want is to use a terms aggregation on the name field and then a top_hits sub-aggregation with size 1, like this:
{
"size": 0,
"query": {
"term": {
"favorite_cars": "ferrari"
}
},
"aggs": {
"names": {
"terms": {
"field": "name"
},
"aggs": {
"single_result": {
"top_hits": {
"size": 1
}
}
}
}
}
}
That way, you'll get a single term ABC and then nested into it a single matching document

ElasticSearch filter by array item

I have the following record in ES:
"authInput" : {
"uID" : "foo",
"userName" : "asdfasdfasdfasdf",
"userType" : "External",
"clientType" : "Unknown",
"authType" : "Redemption_regular",
"uIDExtensionFields" :
[
{
"key" : "IsAccountCreation",
"value" : "true"
}
],
"externalReferences" : []
}
"uIDExtensionFields" is an array of key/value pairs. I want to query ES to find all records where:
"uIDExtensionFields.key" = "IsAccountCreation"
AND "uIDExtensionFields.value" = "true"
This is the filter that I think I should be using but it never returns any data.
GET devdev/authEvent/_search
{
"size": 10,
"filter": {
"and": {
"filters": [
{
"term": {
"authInput.uIDExtensionFields.key" : "IsAccountCreation"
}
},
{
"term": {
"authInput.uIDExtensionFields.value": "true"
}
}
]
}
}
}
Any help you guys could give me would be much appreciated.
Cheers!
UPDATE: WITH THE HELP OF THE RESPONSES BELOW HERE IS HOW I SOLVED MY PROBLEM:
lowercase the value that I was searching for. (changed "IsAccoutCreation" to "isaccountcreation")
Updated the mapping so that "uIDExtensionFields" is a nested type
Updated my filter to the following:
_
GET devhilden/authEvent/_search
{
"size": 10,
"filter": {
"nested": {
"path": "authInput.uIDExtensionFields",
"query": {
"bool": {
"must": [
{
"term": {
"authInput.uIDExtensionFields.key": "isaccountcreation"
}
},
{
"term": {
"authInput.uIDExtensionFields.value": "true"
}
}
]
}
}
}
}
}
There are a few things probably going wrong here.
First, as mconlin points out, you probably have a mapping with the standard analyzer for your key field. It'll lowercase the key. You probably want to specify "index": "not_analyzed" for the field.
Secondly, you'll have to use nested mappings for this document structure and specify the key and the value in a nested filter. That's because otherwise, you'll get a match for the following document:
"uIDExtensionFields" : [
{
"key" : "IsAccountCreation",
"value" : "false"
},
{
"key" : "SomeOtherField",
"value" : "true"
}
]
Thirdly, you'll want to be using the bool-filter's must and not and to ensure proper cachability.
Lastly, you'll want to put your filter in the filtered-query. The top-level filter is for when you want hits to be filtered, but facets/aggregations to not be. That's why it's renamed to post_filter in 1.0.
Here's a few resources you'll want to check out:
Troubleshooting Elasticsearch searches, for Beginners covers the first two issues.
Managing Relations in ElasticSearch covers nested docs (and parent/child)
all about elasticsearch filter bitsets covers and vs. bool.

Resources