I have a simple mapping in elasticsearch-6, like this.
{
"mappings": {
"_doc": {
"properties": {
"#timestamp": {
"type": "date"
},
"fields": {
"properties": {
"meta": {
"properties": {
"task": {
"properties": {
"field1": {
"type": "keyword"
},
"field2": {
"type": "keyword"
}
}
}
}
}
}
}
}
}
}
}
Now I have to add another property to it - tasks which is just an array of the task property already defined.
Is there a way to reference the properties of task so that I don't have to duplicate all the properties? Something like:
{
"fields": {
"properties": {
"meta": {
"properties": {
"tasks": {
"type": "nested",
"properties": "fields.properties.meta.properties.task"
},
"task": {
...
}
}
}
}
}
}
you can already use your task field as an array of task objects, only, you cannot query them independently. If your goal is to achieve this (as I assume from your second example), I would directly set the "nested" data type into the mapping of the task field - then, yes, you'll need to reindex.
I can't imagine a use case where you would need the same array of objects duplicated in two fields, with one nested and the other not.
EDIT
Below, some considerations/suggestions based on the discussion in the comments:
One field can have either one value or an array of values. In your case, your task field can have either one task object or an array of task objects. You should only care about setting the "nested" datatype for task, if you plan to query its objects independently (of course, if they are more than one)
I would suggest to design your documents in such a way to avoid duplicated information in the first place. Duplicated information will make your documents bigger and more complex to process, leading to greater storage requirements and slower queries
If it's not possible to redesign your document mapping, you might check whether alias datatypes can help you avoiding some repetitions.
If it's not possible to redesign your document mapping, you might check whether dynamic templates can help you avoiding some repetitions
Related
I put an object with some field and i wanna figure out how to mapping the index to handle and show the values like elasticsearch. I dunno why opensearch separate to individual fields the values. Both app has the same index mappings but the display is different for something.
I tried to map the object type set to nested but nothing changes
PUT test
{
"mappings": {
"properties": {
"szemelyek": {
"type": "nested",
"properties": {
"szam": {
"type": "integer"
},
"nev": {
"type": "text"
}
}
}
}
}
}
We have a following index schema:
PUT index_data
{
"aliases": {},
"mappings": {
"properties": {
"Id": {
"type": "integer"
},
"title": {
"type": "text"
},
"data": {
"properties": {
"FieldName": {
"type": "text"
},
"FieldValue": {
"type": "text"
}
},
"type": "nested"
}
}
}
}
Id is a unique identifier. Here data field is an array and it could have more than 300,000 objects and may be more. Is it sensible and correct way to index this kind of data? Or we should change our design and make the schema like following:
PUT index_data
{
"aliases": {},
"mappings": {
"properties": {
"Id": {
"type": "integer"
},
"title": {
"type": "text"
},
"FieldName": {
"type": "text"
},
"FieldValue": {
"type": "text"
}
}
}
}
In this design, we cant use Id as a document id because with this design, id would be repeating. If we have 300,000 FieldName and FieldValues for one Id, Id would be repeating 300,000 time. The challenge here is to generate our custom id using some mechanism. Because we need to handle both insert and update cases.
In the first approach one document size would be too large so that it could contain an array of 300,000 objects or may be more.
In second approach, we would have too many documents. 75370611530 is the number we currently have. This is the number of FieldNames and FieldValues we have. How should we handle this kind of data? Which approach would be better? What should be the size of shards in this index?
I noticed that the current mapping is not nested. I assume you would need to be nested as the query seems to be "find value for key = key1".
If it is known that 300K objects are expected - It may not be a good idea. ES soft limit is 10K. Indexing issues are going to give trouble with this approach in addition to possible slow queries.
I doubt if indexing 75 billion documents for this purpose is useful - given the resources required, though it is feasible and will work.
May be consider RDBMS?
If I want to perform a keyword search using a TermQuery, what's the proper way to do this? Am I supposed to prepend ".keyword" to my field name? I would think there is a more first-class citizen way of doing it! 🤷♂️
QueryBuilders.termQuery(SOME_FIELD_NAME + ".keyword", someValue)
It all boils down to your mapping. If your field is mapped as a 'straightforward' keyword like so
{
"mappings": {
"properties": {
"some_field": {
"type": "keyword"
}
}
}
}
you won't need to append .keyword -- you'd do just
QueryBuilders.termQuery(SOME_FIELD_NAME, someValue)
It's good practice, though, not to restrict yourself to only keywords, esp. if you'll be doing partial matches, expansions, autocomplete etc down the line.
A typical text field mapping would look like
PUT kwds
{
"mappings": {
"properties": {
"some_field": {
"type": "text",
"fields": {
"keyword": { <---
"type": "keyword"
},
"analyzed": { <---
"type": "text",
"analyzer": "simple"
},
"...": { <---
...
}
}
}
}
}
}
This means you'd be able to access differently-indexed "versions" (fields) of the same "property" (field). The naming is rather confusing but you get the gist.
Long story short, this is where the .keyword convention stems from. You don't need it if your field is already mapped as a keyword.
I have to upload data to elk in the following format:
{
"location":{
"timestamp":1522751098000,
"resources":[
{
"resource":{
"name":"Node1"
},
"probability":0.1
},
{
"resource":{
"name":"Node2"
},
"probability":0.01
}]
}
}
I'm trying to define a mapping this kind of data and I produced he following mapping:
{
"mappings": {
"doc": {
"properties": {
"location": {
"properties" : {
"timestamp": {"type": "date"},
"resources": []
}
}
}
}
}
I have 2 questions:
how can I define the "resources" array in my mapping?
is it possible to define a custom type (e.g. resource) and use this type in my mapping (e.g "resources": [{type:resource}]) ?
There is a lot of things to know about the Elasticsearch mapping. I really highly suggest to read through at least some of their documentation.
Short answers first, in case you don't care:
Elasticsearch automatically allows storing one or multiple values of defined objects, there is no need to specify an array. See Marker 1 or refer to their documentation on array types.
I don't think there is. Since Elasticsearch 6 only 1 type per index is allowed. Nested objects is probably the closest, but you define them in the same file. Nested objects are stored in a separate index (internally).
Long answer and some thoughts
Take a look at the following mapping:
"mappings": {
"doc": {
"properties": {
"location": {
"properties": {
"timestamp": {
"type": "date"
},
"resources": { [1]
"type": "nested", [2]
"properties": {
"resource": {
"properties": {
"name": { [3]
"type": "text"
}
}
},
"probability": {
"type": "float"
}
}
}
}
}
}
}
}
This is how your mapping could look like. It can be done differently, but I think it makes sense this way - maybe except marker 3. I'll come to these right now:
Marker 1: If you define a field, you usually give it a type. I defined resources as a nested type, but your timestamp is of type date. Elasticsearch automatically allows storing one or multiple values of these objects. timestamp could actually also contain an array of dates, there is no need to specify an array.
Marker 2: I defined resources as a nested type, but it could also be an object like resource a little below (where no type is given). Read about nested objects here. In the end I don't know what your queries would look like, so not sure if you really need the nested type.
Marker 3: I want to address two things here. First, I want to mention again that resource is defined as a normal object with property name. You could do that for resources as well.
Second thing is more a thought-provoking impulse: Don't take it too seriously if something absolutely doesn't fit your case. Just take it as an opinion.
This mapping structure looks very inspired by a relational database approach. I think you usually want to define document structures for elasticsearch more for the expected searches. Redundancy is not a problem, but nested objects can make your queries complicated. I think I would omit the whole resources part and do it something like this:
"mappings": {
"doc": {
"properties": {
"location": {
"properties": {
"timestamp": {
"type": "date"
},
"resource": {
"properties": {
"resourceName": {
"type": "text"
}
"resourceProbability": {
"type": "float"
}
}
}
}
}
}
}
}
Because as I said, in this case resource can contain an array of objects, each having a resourceName and a resourceProbability.
Suppose to have an index with documents describing vehicles.
Your index needs to deal with two different type of vehicles: motorcycle and car.
Which of the following mapping is better from a performance point of view?
(nested is required for my purposes)
"vehicle": {
"type": "nested",
"properties": {
"car": {
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
}
}
},
"motorcycle": {
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
}
}
}
}
}
or this one:
"vehicle": {
"type": "nested",
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
},
"vehicle_type": {
"type": "string" ### "car", "motorcycle"
}
}
}
The second one is more readable and thin.
But the drawback that I'll have is that when I make my queries, if I want to focus only on "car", I need to put this condition as part of the query.
If I use the first mapping, I just need to have a direct access to the stored field, without adding overhead to the query.
The first mapping, where cars and motorcycles are isolated in different fields, is more likely to be faster. The reason is that you have one less filter to apply as you already know, and because of the increased selectivity of the queries (e.g less documents for a given value of vehicle.car.model than just vehicle.model)
Another option would be to create two distinct indexes car and motorcycle, possibly with the same index template.
In Elasticsearch, a query is processed by a single-thread per shard. That means, if you split your index in two, and query both in a single request, it will be executed in parallel.
So, when needed to query only one of cars or motorcycles, it's faster simply because indexes are smaller. And when it comes to query both cars and motorcycles it could also be faster by using more threads.
EDIT: one drawback of the later option you should know, the inner lucene dictionary will be duplicated, and if values in cars and motorcycles are quite identical, it doubles the list of indexed terms.