Using two separate relational indexes in elasticsearch is sense or not? - elasticsearch

I have two relational mysql tables and i need to store these data to elasticsearch.
I stored like this and i wanted to ask you if there is a best way or not :
POST categories/_doc
{
"id" : 1,
"name" : "Phones"
}
POST categories/_doc
{
"id" : 2,
"name" : "TV"
}
PUT products
{
"mappings": {
"properties": {
"attributes": {
"type": "nested"
}
}
}
}
POST products/_doc
{
"id" : 3
"category_id" : 1
"name" : "IPhone 5S",
"attributes" : [
{
"color" : "red",
"stock" : 4
},
{
"color" : "blue",
"stock" : 2
}
]
}
POST products/_doc
{
"id" : 5
"category_id" : 2
"name" : "Samsung TV",
"attributes" : [
{
"color" : "red",
"stock" : 2
},
{
"color" : "yellow",
"stock" : 4
}
]
}
And i use two queries for searching :
I firstly search on categories index after that i send category id values to products index
GET products/_search
{
"query": {
"bool": {
"filter": [
{
"nested": {
"path": "attributes",
"query": {
"terms": {
"attributes.color": [
"red",
"blue"
]
}
},
"inner_hits": {}
}
},
{
"term": {
"category_id": 2
}
}
]
}
}
}
Can you please share your comments about this topic ?
Thank you in advance

It is not a good practice to do like this, since you will have a really difficult time doing sorting and pagination on your application. In Elasticsearch is really important to flat the data for performance reasons and store it in one index, if you have really separated content you can have them in multiple indices. Also keep these points in mind:
Elasticsearch is not a relational database. You will not be able to Join indices as you used to join Tables
Denormalization is not natural but is a key for efficiency in an Elasticsearch application.
Thinking of your data mapping at the early beginning will allow your
app to fly for years.

Related

How to get the best matching document in Elasticsearch?

I have an index where I store all the places used in my documents. I want to use this index to see if the user mentioned one of the places in the text query I receive.
Unfortunately, I have two documents whose name is similar enough to trick Elasticsearch scoring: Stockholm and Stockholm-Arlanda.
My test phrase is intyg stockholm and this is the query I use to get the best matching document.
{
"size": 1,
"query": {
"bool": {
"should": [
{
"match": {
"name": "intyig stockholm"
}
}
],
"must": [
{
"term": {
"type": {
"value": "4"
}
}
},
{
"terms": {
"name": [
"intyg",
"stockholm"
]
}
},
{
"exists": {
"field": "data.coordinates"
}
}
]
}
}
}
As you can see, I use a terms query to find the interesting documents and I use a match query in the should part of the root bool query to use scoring to get the document I want (Stockholm) on top.
This code worked locally (where I run ES in a container) but it broke when I started testing on a cluster hosted in AWS (where I have the exact same dataset). I found this explaining what happens and adding the search type argument actually fixes the issue.
Since the workaround is best not used on production, I'm looking for ways to have the expected result.
Here are the two documents:
// Stockholm
{
"type" : 4,
"name" : "Stockholm",
"id" : "42",
"searchableNames" : [
"Stockholm"
],
"uniqueId" : "Place:42",
"data" : {
"coordinates" : "59.32932349999999,18.0685808"
}
}
// Stockholm-Arlanda
{
"type" : 4,
"name" : "Stockholm-Arlanda",
"id" : "1832",
"searchableNames" : [
"Stockholm-Arlanda"
],
"uniqueId" : "Place:1832",
"data" : {
"coordinates" : "59.6497622,17.9237807"
}
}

elasticearch aggregation by array size

I need a stats on elasticsearch. I can't make the request.
I would like to know the number of people per appointment.
appointment index mapping
{
"id" : "383577",
"persons" : [
{
"id" : "1",
},
{
"id" : "2",
}
]
}
what i would like
"buckets" : [
{
"key" : "1", <--- appointment of 1 person
"doc_count" : 1241891
},
{
"key" : "2", <--- appointment of 2 persons
"doc_count" : 10137
},
{
"key" : "3", <--- appointment of 3 persons
"doc_count" : 8064
}
]
Thank you
The easiest way to do this is to create another integer field containing the length of the persons array and aggregating on that field.
{
"id" : "383577",
"personsCount": 2, <---- add this field
"persons" : [
{
"id" : "1",
},
{
"id" : "2",
}
]
}
The non-optimal way of achieving what you expect is to use a script that will return the length of the persons array dynamically, but be aware that this is sub-optimal and can potentially harm your cluster depending on the volume of data you have:
GET /_search
{
"aggs": {
"persons": {
"terms": {
"script": "doc['persons.id'].size()"
}
}
}
}
If you want to update all your documents to create that field you can do it like this:
POST index/_update_by_query
{
"script": {
"source": "ctx._source.personsCount = ctx._source.persons.length"
}
}
However, you'll also need to modify the logic of your indexing application to create that new field.

Elasticsearch query on a nested field with condition

Elasticsearch v7.0
Hello and good day!
I'm trying to create a query that will have a condition: if a nested field has only 1 element, get that first element, if a nested field has 2 more or elements, get a matching nested field condition
Scenario:
I have an index named socialmedia and has a nested field named cms which places a sentiment for that document
An example document of the cms field looks like this
"_id" : 1,
"cms" : [
{
"cli_id" : 0,
"cmx_sentiment" : "Negative"
}
]
This cms field contains "cli_id" : 0 by default for its 1st element (this means it is for all the clients/users to see) but sooner or later, it goes like this:
"_id": 1,
"cms" : [
{
"cli_id" : 0,
"cmx_sentiment" : "Negative"
},
{
"cli_id" : 1,
"cmx_sentiment" : "Positive"
},
{
"cli_id" : 2,
"cmx_sentiment" : "Neutral"
},
]
The 2nd and 3rd element shows that the clients with cli_id equals to 1 and 2 has made a sentiment for that document.
Now, I want to formulate a query that if the client who logged in has no sentiment yet for a specific document, it fetches the cmx_sentiment that has the "cli_id" : 0
BUT , if the client who has logged in has a sentiment for the fetched documents according to his filters, the query will fetch the cmx_sentiment that has the matching cli_id of the logged in client
for example:
the client who has a cli_id of 2, will get the cmx_sentiment of **Neutral** according to the given document above
the client who has a cli_id of 5, will get the cmx_sentiment of **Negative** because he hasn't given a sentiment to the document
PSEUDO CODE :
If a document has a sentiment indicated by the client, get the cmx_sentiment of the cli_id == to the client's ID
if a document is fresh or the client HAS NOT labeled yet a sentiment on that document, get the element's cmx_sentiment that has cli_id == 0
I'm in need of a query to condition for the pseudo code above
Here's my sample query:
"aggs" => [
"CMS" => [
"nested" => [
"path" => "cms",
],
"aggs" => [
"FILTER" => [
"filter" => [
"bool" => [
"should" => [
[
"match" => [
"cms.cli_id" => 0
]
],
[
"bool" => [
"must" => [
[
// I'm planing to create a bool method here to test if cli_id is equalis to the logged-in client's ID
]
]
]
]
]
]
],
"aggs"=> [
"TONALITY"=> [
"terms"=> [
"field" => "cms.cmx_sentiment"
],
]
]
]
]
]
]
Is my query correct?
The problem with the query I have provided, is that it SUMS all the elements, instead of picking one only
The query above provides this scenario:
The client with cli_id 2 logs in
Both the Neutral and Negative cmx_sentiment are being retrieved, instead of the Neutral alone
After the discussion with OP I'm rewriting this answer.
To get the desired result you will have to consider the following to build the query and aggregation:
Query:
This will contain any filter applied by logged in user. For the example purpose I'm using match_all since every document has atleast one nested doc against cms field i.e. for cli_id: 0
Aggregation:
Here we have to divide the aggregations into two:
default_only
sentiment_only
default_only
In this aggregation we find count for those document which don't have nested document for cli_id: <logged in client id>. i.e. only those docs which have nested doc for cli_id: 0.
To do this we follow the steps below:
default_only Use filter aggregation to get document which does not have nested document for cli_id: <logged in client id> i.e. using must_not => cli_id: <logged in client id>
default_nested : Add sub aggregation for nested docs since we need to get the docs against sentiment which is field of nested document.
sentiment_for_cli_id : Add sub aggregation to default_nested aggregation in order to get sentiment only for default client i.e. for cli_id: 0.
default : Add this terms sub aggregation to sentiment_for_cli_id aggregation to get counts against the sentiment. Note that this count is of nested docs and since you always have only one nested doc per cli_id therefore this count seems to be the count of docs but it is not.
the_doc_count: Add this reverse_nested aggregation to get out of nested doc aggs and the count of parent docs. We add this as the sub aggregation of default aggregation.
sentiment_only
This aggregation give count against each sentiment where cli_id: <logged in client id> is present. For this we follow the same approach as we followed for default_only aggregation. But with some tweaks as below:
sentiment_only : must => cli_id: <logged in client id>
sentiment_nested : same reason as above
sentiment_for_cli_id: same but instead of default we filter for cli_id: <logged in client id>
sentiment: same as default
the_doc_count: same as above
Example:
PUT socialmedia/_bulk
{"index":{"_id": 1}}
{"cms":[{"cli_id":0,"cmx_sentiment":"Positive"}]}
{"index":{"_id": 2}}
{"cms":[{"cli_id":0,"cmx_sentiment":"Positive"},{"cli_id":2,"cmx_sentiment":"Neutral"}]}
{"index":{"_id": 3}}
{"cms":[{"cli_id":0,"cmx_sentiment":"Positive"},{"cli_id":2,"cmx_sentiment":"Negative"}]}
{"index":{"_id": 4}}
{"cms":[{"cli_id":0,"cmx_sentiment":"Positive"},{"cli_id":2,"cmx_sentiment":"Neutral"}]}
Query:
GET socialmedia/_search
{
"query": {
"match_all": {}
},
"aggs": {
"default_only": {
"filter": {
"bool": {
"must_not": [
{
"nested": {
"path": "cms",
"query": {
"term": {
"cms.cli_id": 2
}
}
}
}
]
}
},
"aggs": {
"default_nested": {
"nested": {
"path": "cms"
},
"aggs": {
"sentiment_for_cli_id": {
"filter": {
"term": {
"cms.cli_id": 0
}
},
"aggs": {
"default": {
"terms": {
"field": "cms.cmx_sentiment"
},
"aggs": {
"the_doc_count": {
"reverse_nested": {}
}
}
}
}
}
}
}
}
},
"sentiment_only": {
"filter": {
"bool": {
"must": [
{
"nested": {
"path": "cms",
"query": {
"term": {
"cms.cli_id": 2
}
}
}
}
]
}
},
"aggs": {
"sentiment_nested": {
"nested": {
"path": "cms"
},
"aggs": {
"sentiment_for_cli_id": {
"filter": {
"term": {
"cms.cli_id": 2
}
},
"aggs": {
"sentiment": {
"terms": {
"field": "cms.cmx_sentiment"
},
"aggs": {
"the_doc_count": {
"reverse_nested": {}
}
}
}
}
}
}
}
}
}
}
}
Agg Output:
"aggregations" : {
"default_only" : {
"doc_count" : 1,
"default_nested" : {
"doc_count" : 1,
"sentiment_for_cli_id" : {
"doc_count" : 1,
"default" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Positive",
"doc_count" : 1,
"the_doc_count" : {
"doc_count" : 1
}
}
]
}
}
}
},
"sentiment_only" : {
"doc_count" : 3,
"sentiment_nested" : {
"doc_count" : 6,
"sentiment_for_cli_id" : {
"doc_count" : 3,
"sentiment" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Neutral",
"doc_count" : 2,
"the_doc_count" : {
"doc_count" : 2
}
},
{
"key" : "Negative",
"doc_count" : 1,
"the_doc_count" : {
"doc_count" : 1
}
}
]
}
}
}
}
}

Elastic Search Query for Multi-valued Data

ES Data is indexed like this :
{
"addresses" : [
{
"id" : 69,
"location": "New Delhi"
},
{
"id" : 69,
"location": "Mumbai"
}
],
"goods" : [
{
"id" : 396,
"name" : "abc",
"price" : 12500
},
{
"id" : 167,
"name" : "XYz",
"price" : 12000
},
{
"id" : 168,
"name" : "XYz1",
"price" : 11000
},
{
"id" : 169,
"name" : "XYz2",
"price" : 13000
}
]
}
In my query I want to fetch records which should have at-least one of the address matched and goods price range between 11000 and 13000 and name xyz.
When your data contains arrays of complex objects like a list of addresses or a list of goods, you probably want to have a look at elasticsearch's nested objects to avoid running into problems when your queries result in more items than you would expect.
The issue here is the way how elasticsearch (and in effect lucene) stores the data. As there is no such concept of lists of nested objects directly, the data is flattened and the connection between e.g. XYz and 12000 is lost. So you would also get this document as result when you query for XYz and 12500 as the price of 12500 is also there in the list of values for goods.price. To avoid this, you can use the nested objects feature of elasticsearch which basically extracts all inner objects into a hidden index and allows querying for several fields that occur in one specific object instead of "in any of the objects". For more details, have a look at the docs on nested objects which also explains this pretty good.
In your case a mapping could look like the following. I assume, you only want to query for the addresses.location text without providing the id, so that this list can remain the simple object type instead of also being a nested type. Also, I assume you query for exact matches. If this is not the case, you need to switch from keyword to text and adapt the term query to be some match one...
PUT nesting-sample
{
"mappings": {
"item": {
"properties": {
"addresses": {
"properties": {
"id": {"type": "integer"},
"location": {"type": "keyword"}
}
},
"goods": {
"type": "nested",
"properties": {
"id": {"type": "integer"},
"name": {"type": "keyword"},
"price": {"type": "integer"}
}
}
}
}
}
}
You can then use a bool query on the location and a nested query to match the inner documents of your goods list.
GET nesting-sample/item/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"addresses.location": "New Delhi"
}
},
{
"nested": {
"path": "goods",
"query": {
"bool": {
"must": [
{
"range": {
"goods.price": {
"gte": 12200,
"lt": 12999
}
}
},
{
"term": {
"goods.name": {
"value": "XYz"
}
}
}
]
}
}
}
}
]
}
}
}
This query will not match the document because the price range is not in the same nested object as the exact name of the good. If you change the lower bound to 12000 it will match.
Please check your use case and be aware of the warning on the bottom of the docs regarding the mapping explosion when using nested fields.

ElasticSearch filter by array item

I have the following record in ES:
"authInput" : {
"uID" : "foo",
"userName" : "asdfasdfasdfasdf",
"userType" : "External",
"clientType" : "Unknown",
"authType" : "Redemption_regular",
"uIDExtensionFields" :
[
{
"key" : "IsAccountCreation",
"value" : "true"
}
],
"externalReferences" : []
}
"uIDExtensionFields" is an array of key/value pairs. I want to query ES to find all records where:
"uIDExtensionFields.key" = "IsAccountCreation"
AND "uIDExtensionFields.value" = "true"
This is the filter that I think I should be using but it never returns any data.
GET devdev/authEvent/_search
{
"size": 10,
"filter": {
"and": {
"filters": [
{
"term": {
"authInput.uIDExtensionFields.key" : "IsAccountCreation"
}
},
{
"term": {
"authInput.uIDExtensionFields.value": "true"
}
}
]
}
}
}
Any help you guys could give me would be much appreciated.
Cheers!
UPDATE: WITH THE HELP OF THE RESPONSES BELOW HERE IS HOW I SOLVED MY PROBLEM:
lowercase the value that I was searching for. (changed "IsAccoutCreation" to "isaccountcreation")
Updated the mapping so that "uIDExtensionFields" is a nested type
Updated my filter to the following:
_
GET devhilden/authEvent/_search
{
"size": 10,
"filter": {
"nested": {
"path": "authInput.uIDExtensionFields",
"query": {
"bool": {
"must": [
{
"term": {
"authInput.uIDExtensionFields.key": "isaccountcreation"
}
},
{
"term": {
"authInput.uIDExtensionFields.value": "true"
}
}
]
}
}
}
}
}
There are a few things probably going wrong here.
First, as mconlin points out, you probably have a mapping with the standard analyzer for your key field. It'll lowercase the key. You probably want to specify "index": "not_analyzed" for the field.
Secondly, you'll have to use nested mappings for this document structure and specify the key and the value in a nested filter. That's because otherwise, you'll get a match for the following document:
"uIDExtensionFields" : [
{
"key" : "IsAccountCreation",
"value" : "false"
},
{
"key" : "SomeOtherField",
"value" : "true"
}
]
Thirdly, you'll want to be using the bool-filter's must and not and to ensure proper cachability.
Lastly, you'll want to put your filter in the filtered-query. The top-level filter is for when you want hits to be filtered, but facets/aggregations to not be. That's why it's renamed to post_filter in 1.0.
Here's a few resources you'll want to check out:
Troubleshooting Elasticsearch searches, for Beginners covers the first two issues.
Managing Relations in ElasticSearch covers nested docs (and parent/child)
all about elasticsearch filter bitsets covers and vs. bool.

Resources