Having values as keys VS having them as a nested object array in ElasticSearch

Having values as keys VS having them as a nested object array in ElasticSearch - elasticsearch

Currently , I have a elasticsearch index with a field that has subfields like say A,B,C as below:
"myfield":{
"A":{
"name":"A",
"prop1":{
"sub-prop1":1,
"sub-prop2":2
},
"prop2":{}
},
"B":{
"name":"B",
"prop1":{
"sub-prop1":3,
"sub-prop2":8,
"sub-prop3":4,
"sub-prop4":7,
},
"prop2":{}
},
"C":{}
}
As can be seen, the structure of A and B fields are same, but the sub-props under the prop1 can be dynamic , meaning based on the documents added, the mapping might change but its not an issue as A and B exist as separate keys.However, because of this I am facing another problem, in that keeping on adding new documents, due to dynamic mapping, its possible that such sub-props or sub-fields like A,B,C,D ... and so on keep getting added to the mapping, which in turn might cause the mapping to exceed the index.mapping.total_fields.limit ,so to avoid that I am planning to make "myfield" and "prop1" fields as array of objects instead in the mapping, so that the fields A,B,C... are stored as array elements instead of keep getting added to the mapping as new fields.
The question is - is this a feasible solution and how to search for say, "myfield.A.prop1.sub-prop1" >= 3
the new mapping looks something like:
"myfield":[
{
"name":"A",
"prop1":{
"sub-prop1":1,
"sub-prop2":2
},
"prop2":{}
},
{
"name":"B",
"prop1":{
"sub-prop1":3,
"sub-prop2":8,
"sub-prop3":4,
"sub-prop4":7,
},
"prop2":{}
},
{}
]

Related

Indexing strategy for hierarchical structures on ElasticSearch

Let's say I have hierarchical types such as in example below:
base_type
child_type1
child_type3
child_type2
child_type1 and child_type2 inherit metadata properties from base_type. child_type3 has all properties inherited from both child_type1 and base_type.
To add to the example, here's several objects with their properties:
base_type_object: {
base_type_property: "bto_prop_value_1"
},
child_type1_object: {
base_type_property: "ct1o_prop_value_1",
child_type1_property: "ct1o_prop_value_2"
},
child_type2_object: {
base_type_property: "ct2o_prop_value_1",
child_type2_property: "ct2o_prop_value_2"
},
child_type3_object: {
base_type_property: "ct3o_prop_value_1",
child_type1_property: "ct3o_prop_value_2",
child_type3_property: "ct3o_prop_value_3"
}
When I query for base_type_object, I expect to search base_type_property values in each and every one of the child types as well. Likewise, if I query for child_type1_property, I expect to search through all types that have such property, meaning objects of type child_type1 and child_type3.
I see that mapping types have been removed. What I'm wondering is whether this use case warrants indexing under separate indices.
My current line of thinking using example above would be to create 4 indices: base_type_index, child_type1_index, child_type2_index and child_type3_index. Each index would only have mappings of their own properties, so base_type_index would only have base_type_property, child_type1_index would have child_type1_property etc. Indexing child_type1_object would create an entry on both base_type_index and child_type1_index indices.
This seems convenient because, as far as I can see, it's possible to search multiple indices using GET /my-index-000001,my-index-000002/_search. So I would theoretically just need to list hierarchy of my types in GET request: GET /base_type_index,child_type1_index/_search.
To make it easier to understand, here is how it would be indexed:
base_type_index
base_type_object: {
base_type_property: "bto_prop_value_1"
},
child_type1_object: {
base_type_property: "ct1o_prop_value_1"
},
child_type2_object: {
base_type_property: "ct2o_prop_value_1",
},
child_type3_object: {
base_type_property: "ct3o_prop_value_1",
}
child_type1_index
child_type1_object: {
child_type1_property: "ct1o_prop_value_2"
},
child_type3_object: {
child_type1_property: "ct3o_prop_value_2",
}
I think values for child_type2_index and child_type3_index are apparent, so I won't list them in order to keep the post length at a more reasonable level.
Does this make sense and is there a better way of indexing for my use case?

Elasticsearch nested objects with query_string as first class attributes

I'm trying to index a nested field as a first-class attribute in my document so that I can search them using query_string without dot syntax.
For example, if I have a document like
"data": { "name": "Bob" }
instead of searching for data.name:Bob I would like to be able to search for name:Bob
The root of my issue is that we index a jsonb column that may have varying attributes. In some instances the data property may contain a data.business attribute, etc. I would like users to be able to search on these attributes without needing to "dig" into the object.
The data field does not have to be indexed as a nested type unless necessary; I was indexing it as an object previously.
I have tried to leverage the _all field as suggested in this post.
I have also tried to use include_in_parent:true and set the datatype as nested for my data field as suggested in this post.
I have also looked into the inner_hits feature to no avail.
Here's an example of my mapping for the data attribute.
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"data": {
"type": "object"
}
}
}
}
}
Example document
PUT my_index/_doc/1
{
"data": {
name: "bob",
business: "None of yours"
}
}
And how my query currently looks:
GET my_index/_search
{
"query": {
"query_string": {
"query": "name:bob",
"fields": ["data.*"]
}
}
}
With the current setup I almost get my desired results. I can search on individual properties like data.name:bob and data.business:"None of yours" and get back the correct documents.
However I want to be able to get the exact same results with business:"None of yours" or name:bob.
Thanks in advance for any help!

I figured it out using dynamic templates. For anyone coming across this in the future, here is how I solved the issue:
I used path_match to match the data object (data.*).
Then using copy_to and {name} I dynamically created top-level fields on my parent object.
{
"dynamic_templates":[
{"template_1":
{"mapping":
{"copy_to":"{name}"},
"path_match":"data.*"
}
}
]
}

how to use Elastic Search nested queries by object key instead of object property

Following the Elastic Search example in this article for a nested query, I noticed that it assumes the nested objects are inside an ARRAY and that queries are based on some object PROPERTY:
{
nested_objects: [ <== array
{ name: "x", value: 123 },
{ name: "y", value: 456 } <== "name" property searchable
]
}
But what if I want nested objects to be arranged in key-value structure that gets updated with new objects, and I want to search by the KEY? example:
{
nested_objects: { <== key-value, not array
"x": { value: 123 },
"y": { value: 456 } <== how can I search by "x" and "y" keys?
"..." <=== more arbitrary keys are added now and then
]
}
Thank you!

You can try to do this using the query_string query, like this:
GET my_index/_search
{
"query": {
"query_string": {
"query":"nested_objects.\\*.value:123"
}
}
}
It will try to match the value field of any sub-field of nested_objects.

Ok, so my final solution after some ES insights is as follows:
1. The fact that my object keys "x", "y", ... are arbitrary causes a mess in my index mapping. So generally speaking, it's not a good ES practice to plan this kind of structure... So for the sake of mappings, I resort to the structure described in the "Weighted tags" article:
{ "name":"x", "value":123 },
{ "name":"y", "value":456 },
...
This means that, when it's time to update the value of the sub-object named "x", I'm having a harder (and slower) time finding it: I first need to query the entire top-level object, traverse the sub objects until I find one named "x" and then update its value. Then I update the entire sub-object array back into ES.
The above approach also causes concurrency issues in case I have multiple processes updating the same index. ES has optimistic locking I can use to retry when needed, or, I can queue updates and handle them serially

Is there a way to apply the synonym token filter in ElasticSearch to field names rather than the value?

Consider the following JSON file:
{
"titleSony": "Matrix",
"cast": [
{
"firstName": "Keanu",
"lastName": "Reeves"
}
]
}
Now, I know in ElasticSearch, you can apply a synonym token filter to field values as given in the following link: Elasticsearch Analysis: Synonym token filter.
Hence, I can create a "synonym.txt" file with Matrix => Matx, then if I search for titleSony:Matx, it will return the documents with Matrix as well.
Now, what I would like is to create a synonym for the field name titleSony. For example - titleSony => titleAll, such that when I search for titleAll, I should get all documents with titleSony as well.
Is there any way to accomplish this in ElasticSearch?

Now, what I would like is to create a synonym for the field name "titleSony". For example - titleSony => titleAll , hence when I search for "titleAll", I should get all documents with "titleSony" as well.
Yes, somewhat. Elasticsearch has some default behavior very similar to this, which I'll touch on in a bit.
The feature you're looking for is called "Copy to field." It allows you to specify that the terms in one field should be copied into another. This is useful for consolidating terms you expect to match into a single field, to help simplify your query when you would like to match against any one of a number of fields.
In this example, you would specify in your mapping that the terms in the titleSony field ought to be copied into the titleAll field. Presumably you'd have other fields (say, titleDisney) which also copy into that field as well. So a search against titleAll will effectively match the other fields whose terms are copied into it.
An excerpt of your mapping might look something like this:
{
"movies" : {
"properties" : {
"titleSony" : { "type" : "string", "copy_to" : "titleAll" },
"titleDisney" : { "type" : "string", "copy_to" : "titleAll" },
"titleAll" : { "type" : "string" },
"cast" : { ... },
...
}
}
I mentioned earlier that Elasticsearch does something like this. By default it creates a special field called _all into which all the document's terms are copied. This field lets you construct very simple queries to match against terms that occur in any field on the document. So as you see, this is a fairly common convention in Elasticsearch. (Elasticsearch mapping: _all field.)

Do changes to elasticsearch mapping apply to already indexed documents?

If I change the mapping so certain properties have new/different boost values, does that work even if the documents have already been indexed? Or do the boost values get applied when the document is indexed?

You cannot change field level boost factors after indexing data. It's not even possible for new data to be indexed once the same fields have been indexed already for previous data.
The only way to change the boost factor is to reindex your data. The pattern to do this without changing the code of your application is to use aliases. An alias points to a specific index. In case you want to change the index, you create a new index, then reindex data from the old index to the new index and finally you change the alias to point to the new index. Reindexing data is either supported by the elasticsearch library or can be achieved with a scan/scroll.
First version of mapping
Index: items_v1
Alias: items -> items_v1
Change necessary, sencond version of the index with new field level boost values :
Create new index: items_v2
Reindex data: items_v1 => items_v2
Change alias: items -> items_v2
This might be useful in other situations where you want to change your mapping.
Field level boosts are, however, not recommended. The better approach is to use boosting at query time.
Alias commands are:
Adding an alias
POST /_aliases
{
"actions": [
{ "add": {
"alias": "tems",
"index": "items_v1"
}}
]
}
Removing an alias
POST /_aliases
{
"actions": [
{ "remove": {
"alias": "tems",
"index": "items_v1"
}}
]
}

They do not.
Index time boosting is generally not recommended. Instead, you should do your boosting when you search.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio