I have a document storage with multiple types. Each document type has some basic metadata, like uuid, and a single "entity" field holding a stringified json with the actual content. This is because the document, event though it has a type, does not have a strict schema and any user can provide data in any structure.
I need to be able to browse, filter and search through these documents so I will be putting them into ElasticSearch.
My question is: how should I structure the ES? I have read that having too many indexes is not good for ES and that it is better to have as least indexes as possible. But ES also does not like if documents of the same type have different structure(mapping) + you cannot change mapping for existing fields, only append for new ones.
The "schema" is fixed for every document type and user so I could create new index for each user with the same type(s) in it but as I've mentioned, having lots of indexes is bad.
So what is the recommended design in such case?
This might sound crazy but would it be feasible to parse the document into key/value format where the key would be the property path? The only issues I see here is that everything would have to be set as fulltext which does not sound like a good idea.
Edit: seems like ES does this on its own https://www.elastic.co/guide/en/elasticsearch/reference/current/object.html but I'm still not sure what to do.
What you could do is to have an array of nested object types with a key and value fields, i.e. your mapping would look like
"entity": {
"type": "nested",
"properties": {
"key": {
"type": "keyword"
},
"value": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
That way you can store pretty much anything you want in the entity field without risking a mapping type explosion, for instance
{
"uuid": "",
"entity": [
{"key": "myfield1", "value": "Some value"},
{"key": "myfield2", "value": "Some value"},
{"key": "myfield3", "value": "Some value"}
]
}
Then you'll have to make sure to use nested queries when querying your data but it's definitely feasible.
Related
Im new to Elasticsearch and I would like to know if there are any good practices for the use case I have.
I have heterogeneous data sent from an API that I save into a database (as a JSON) then save in Elasticsearch for search purposes. The data in sent in this format (because it's heterogeneous, the users can send any type of data, some metadata can be multivalued, other single values and the name of the key in the JSON may vary :)
{
"indices":{
"MultipleIndices":[
{
"index":"editors",
"values":[
"The Editing House",
"Volcan Editing"
]
},
{
"index":"colors",
"values":[
"Red",
"Blue"
]
}
],
"SimpleIndices":[
{
"index":"AuthorName",
"value": "George R. R. Martin"
},
{
"index":"NumberOfPages",
"value":"2898"
},
{
"index":"BookType",
"value":"Fantasy"
}
]
}
}
Once we receive this JSON, its formatted in the code and stored as a JSON in a database with this format :
{
"indices":{
"editors":[
"The Editing House",
"Volcan Editing"
],
"colors":[
"Red",
"Blue"
],
"AuthorName" : "George R. R. Martin"
"NumberOfPages" : "2898",
"BookType" : "Fantasy"
}
}
I then want to save this data into Elasticsearch, what's the best way I can map it ? Store it as a JSON in one field ? Will the search be efficilent if I do it this way ?
You must mapping each field individually.
You can take a look at field types to understand which type is ideal for your schema.
Another suggestion is to study the text analysis because it is responsible for the process of structuring the text to optimize the search.
My suggestion map:
PUT indices
{
"mappings": {
"properties": {
"editors": {
"type": "keyword"
},
"colors":{
"type": "keyword"
},
"author_name":{
"type": "text"
},
"number_pages":{
"type": "integer"
},
"book_type":{
"type": "keyword"
}
}
}
}
I think in your case, you don't have much choice apart from dynamic mapping, which Elasticsearch will generate for you as soon as first document is index in a particular index.
However, you can improve the process by using the dynamic template so that you can optimize your mapping, there is good examples of that in the official link I provided.
I have an index that contains documents of different types (not talking about _type here) and each document has a field document_type that states their type. Is it possible to define mappings for each type of document within this index?
Is it possible to define mappings for each type of document within this index?
No, if you think of using the same field name with different types. For instance, field name id of type string and integer won't work.
Having different document_type basically indicates different domains. What you could do is to group information under each respective domain or type. For instance, an employee and project, both have an id and name, but different types in this example. Some call that nesting.
An example index mapping:
PUT example
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"doc": {
"properties": {
"employee": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 64
}
}
}
}
},
"project": {
"properties": {
"id": {
"type": "keyword"
},
"name": {
"type": "keyword",
"ignore_above": 32
}
}
}
}
}
}
}
If you write the information, with different types.
PUT example/doc/1
{
"employee": {
"id": 4711,
"name": "John Doe"
},
"project": {
"id": "Project X",
"name": "Firebrand"
}
}
Others would argue to store employee and project in separate indices. This approach depends on your scenario and is also desirable. You allow both domains to evolve separately from each other.
Having a separate employee and project index gives you an advantage regarding maintenance. For querying some would argue, that you can group than with an alias. In the above example, it doesn't make sense since the field types are different. A search for the name over an analysed text field is different than over a keyword. Querying makes sense if you have the same field type.
No, if you want to use a single index, you would need to define a single mapping that combines the fields of each document type.
A better way might be to define separate indices on the same cluster for each document type. You can then create a single index alias that aliases to both of those indices if you want to be able to query across document types. Be sure that all fields that exist in both documents have the same data type in both mappings.
Having a single field name with more than one mapping type in the same index is not possible. Two options I can think of:
1. Separate the different doc types to separate indices.
2. Use different fields names for different doc types, so that each name can have different mapping. You can also use nesting, like: type_a.my_field and type_b.my_field, both in the same index.
I'm reading about mapping in elasticsearch and I see these 2 terms: Nested-field & Depth. I think these 2 terms are quite equivalent. I'm currently confused by these 2. Please can anyone clear me out? Thank you.
And btw, are there any ways to check a document depth via Kibana?
Sorry for my english.
The source of confusion is probably because in Elasticsearch term nested can be used in two different contexts:
"nested" as a regular JSON notation nested, i.e. JSON object within JSON object;
"nested" as Elasticsearch nested data type.
In the mappings documentation page when they mention "depth" they refer to the first meaning. Here the setting index.mapping.depth.limit defines how deeply nested can your JSON documents be.
How is JSON depth interpreted by Elasticsearch mapping?
Here is an example of JSON document with depth 1:
{
"name": "John",
"age": 30
}
Now with depth 2:
{
"name": "John",
"age": 30,
"cars": {
"car1": "Ford",
"car2": "BMW",
"car3": "Fiat"
}
}
By default (as of ES 6.3) the depth cannot exceed 20.
What is a nested data type and why isn't it the same as a document with depth>1?
nested data type allows to index arrays of objects and query their items individually via nested query. What this means is that Elasticsearch will index a document with such fields differently (see the page Nested Objects of the Definitive Guide for more explanation).
For instance, if in the following example we do not define "user" as nested field in the mapping, a query for user.first: John and user.last: White will return a match and it will be a mistake:
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
If we do, Elasticsearch will index each item of the "user" list as an implicit sub-document and thus will use more resources, more disk and memory. This is why there is also another setting on the mappings: index.mapping.nested_fields.limit regulates how many different nested fields one can declare (which defaults to 50). To customize this you can see this answer.
So, Elasticsearch documents with depth > 1 are not indexed as nested unless you explicitly ask it to do so, and that's the difference.
Can I have nested fields inside nested?
Yes, you can! Just to stop this confusion, yes, you can define a nested field inside nested field in a mapping. It will look something like this:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"user": {
"type": "nested",
"properties": {
"name": {
"type": "keyword"
},
"cars": {
"type": "nested",
"properties": {
"brand": {
"type": "keyword"
}
}
}
}
}
}
}
}
}
But keep in mind that the amount of implicit documents to be indexed will be multiplied, and it will be simply not that efficient.
Can I get the depth of my JSON objects from Kibana?
Most likely you can do it with scripts, check this blog post for further details: Using Painless in Kibana scripted fields.
I'm using Elastic 1.7.3 and I would like to have a boost on some fields in a index with documents like this fictional example :
{
title: "Mickey Mouse",
content: "Mickey Mouse is a fictional ...",
related_articles: [
{"title": "Donald Duck"},
{"title": "Goofy"}
]
}
Here eg: title is really important, content too, related_articles is a bit more important. My real document have lot of fields and nested object.
I would like to give more weight to the title field than content, and more to content than related_articles.
I have seen the title^5 way, but I must use it at each query and I must (I guess) list all my fields instead of a "_all" query.
I do a lot of search but I found lot of deprecated solutions (_boost by eg).
As I used to work with Sphinx : I search something that works like the field weight option where you can give some weight to field that are really important in your index than others.
You're right that the _boost meta-field that you could use at the type level has been deprecated.
But you can still use the boost property when defining each field in your mapping, which will boost your field at indexing time.
Your mapping would look like this:
{
"my_type": {
"properties": {
"title": {
"type": "string", "boost": 5
},
"content": {
"type": "string", "boost": 4
},
"related_articles": {
"type": "nested",
"properties": {
"title": {
"type": "string", "boost": 3
}
}
}
}
}
}
You have to be aware, though, that it's not necessarily a good idea to boost your field at index time, because once set, you cannot change it unless you are willing to re-index all of your documents, whereas using query-time boosting achieves the same effect and can be changed more easily.
Lets say I have the following mapping:
"site": {
"properties": {
"title": { "type": "string" },
"description": { "type": "string" },
"category": { "type": "string" },
"tags": { "type": "array" },
"point": { "type": "geo_point" }
"localities": {
type: 'nested',
properties: {
"title": { "type": "string" },
"description": { "type": "string" },
"point": { "type": "geo_point" }
}
}
}
}
I'm then doing an "_geo_distance" sort on the parent document and am able to sort the documents on "site.point". However I would also like the nested localities to be sorted by "_geo_distance", inside the parent document.
Is this possible? If so, how?
Unfortunately, no (at least not yet).
A query in ElasticSearch just identifies which documents match the query, and how well they match.
To understand what nested documents are useful for, consider this example:
{
"title": "My post",
"body": "Text in my body...",
"followers": [
{
"name": "Joe",
"status": "active"
},
{
"name": "Mary",
"status": "pending"
},
]
}
The above JSON, once indexed in ES, is functionally equivalent to the following. Note how the followers field has been flattened:
{
"title": "My post",
"body": "Text in my body...",
"followers.name": ["Joe","Mary"],
"followers.status": ["active","pending"]
}
A search for: followers with status == active and name == Mary would match this document... incorrectly.
Nested fields allow us to work around this limitation. If the followers field is declared to be of type nested instead of type object then its contents are created as a separate (invisible) sub-document internally. That means that we can use a nested query or nested filter to query these nested documents as individual docs.
However, the output from the nested query/filter clauses only tells us if the main doc matches, and how well it matches. It doesn't even tell us which of the nested docs matched. To figure that out, we'd have to write code in our application to check each of the nested docs against our search criteria.
There are a few open issues requesting the addition of these features, but it is not an easy problem to solve.
The only way to achieve what you want is to index your sub-docs as separate documents, and to query and sort them independently. It may be useful to establish a parent-child relationship between the main doc and these separate sub-docs. (see parent-type mapping, the Parent & Child section of the index api docs, and the top-children and has-child queries.
Also, an ES user has mailed the list about a new has_parent filter that they are currently working on in a fork. However, this is not available in the main ES repo yet.