Suppose to have an index with documents describing vehicles.
Your index needs to deal with two different type of vehicles: motorcycle and car.
Which of the following mapping is better from a performance point of view?
(nested is required for my purposes)
"vehicle": {
"type": "nested",
"properties": {
"car": {
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
}
}
},
"motorcycle": {
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
}
}
}
}
}
or this one:
"vehicle": {
"type": "nested",
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
},
"vehicle_type": {
"type": "string" ### "car", "motorcycle"
}
}
}
The second one is more readable and thin.
But the drawback that I'll have is that when I make my queries, if I want to focus only on "car", I need to put this condition as part of the query.
If I use the first mapping, I just need to have a direct access to the stored field, without adding overhead to the query.
The first mapping, where cars and motorcycles are isolated in different fields, is more likely to be faster. The reason is that you have one less filter to apply as you already know, and because of the increased selectivity of the queries (e.g less documents for a given value of vehicle.car.model than just vehicle.model)
Another option would be to create two distinct indexes car and motorcycle, possibly with the same index template.
In Elasticsearch, a query is processed by a single-thread per shard. That means, if you split your index in two, and query both in a single request, it will be executed in parallel.
So, when needed to query only one of cars or motorcycles, it's faster simply because indexes are smaller. And when it comes to query both cars and motorcycles it could also be faster by using more threads.
EDIT: one drawback of the later option you should know, the inner lucene dictionary will be duplicated, and if values in cars and motorcycles are quite identical, it doubles the list of indexed terms.
Related
We have a following index schema:
PUT index_data
{
"aliases": {},
"mappings": {
"properties": {
"Id": {
"type": "integer"
},
"title": {
"type": "text"
},
"data": {
"properties": {
"FieldName": {
"type": "text"
},
"FieldValue": {
"type": "text"
}
},
"type": "nested"
}
}
}
}
Id is a unique identifier. Here data field is an array and it could have more than 300,000 objects and may be more. Is it sensible and correct way to index this kind of data? Or we should change our design and make the schema like following:
PUT index_data
{
"aliases": {},
"mappings": {
"properties": {
"Id": {
"type": "integer"
},
"title": {
"type": "text"
},
"FieldName": {
"type": "text"
},
"FieldValue": {
"type": "text"
}
}
}
}
In this design, we cant use Id as a document id because with this design, id would be repeating. If we have 300,000 FieldName and FieldValues for one Id, Id would be repeating 300,000 time. The challenge here is to generate our custom id using some mechanism. Because we need to handle both insert and update cases.
In the first approach one document size would be too large so that it could contain an array of 300,000 objects or may be more.
In second approach, we would have too many documents. 75370611530 is the number we currently have. This is the number of FieldNames and FieldValues we have. How should we handle this kind of data? Which approach would be better? What should be the size of shards in this index?
I noticed that the current mapping is not nested. I assume you would need to be nested as the query seems to be "find value for key = key1".
If it is known that 300K objects are expected - It may not be a good idea. ES soft limit is 10K. Indexing issues are going to give trouble with this approach in addition to possible slow queries.
I doubt if indexing 75 billion documents for this purpose is useful - given the resources required, though it is feasible and will work.
May be consider RDBMS?
I have a dataset of about 20 million records, with the following structure:
{"id": "123",
"cites":[
{"id":"234", "date":"2018-05-04"},
{"id":"456","date":"2018-02-01"}]
}
and I would like to make an index where I can see the list of the ids that cite an article, something like
{"id":"234", "cited_by":[{"id":"123"},{"id:"188"}]}
Which I understand is technically an inverted index. This can be static, so it could be computed just a single time. I've only seen documentation about inverted indices used for terms and their frequency in phrases, which is a very different use case.
I looked into using aggregations, but because the number of different ids is too large it runs out of buckets, and I am not sure 20 million buckets are possible and/or a good idea.
How could I generate this index? Is it possible in ElasticSearch, or would I need to write an external script that does this in batches?
Thank you so much!
No problem to use ElasicSearch in your case.
Script to make the index
PUT /city_index
{
"mappings": {
"citydata": {
"dynamic": "false",
"properties": {
"id": {
"type": "keyword"
},
"cited_by": {
"type": "object",
"properties": {
"id": {
"type": "keyword"
}
}
}
}
}
}
}
I have a simple mapping in elasticsearch-6, like this.
{
"mappings": {
"_doc": {
"properties": {
"#timestamp": {
"type": "date"
},
"fields": {
"properties": {
"meta": {
"properties": {
"task": {
"properties": {
"field1": {
"type": "keyword"
},
"field2": {
"type": "keyword"
}
}
}
}
}
}
}
}
}
}
}
Now I have to add another property to it - tasks which is just an array of the task property already defined.
Is there a way to reference the properties of task so that I don't have to duplicate all the properties? Something like:
{
"fields": {
"properties": {
"meta": {
"properties": {
"tasks": {
"type": "nested",
"properties": "fields.properties.meta.properties.task"
},
"task": {
...
}
}
}
}
}
}
you can already use your task field as an array of task objects, only, you cannot query them independently. If your goal is to achieve this (as I assume from your second example), I would directly set the "nested" data type into the mapping of the task field - then, yes, you'll need to reindex.
I can't imagine a use case where you would need the same array of objects duplicated in two fields, with one nested and the other not.
EDIT
Below, some considerations/suggestions based on the discussion in the comments:
One field can have either one value or an array of values. In your case, your task field can have either one task object or an array of task objects. You should only care about setting the "nested" datatype for task, if you plan to query its objects independently (of course, if they are more than one)
I would suggest to design your documents in such a way to avoid duplicated information in the first place. Duplicated information will make your documents bigger and more complex to process, leading to greater storage requirements and slower queries
If it's not possible to redesign your document mapping, you might check whether alias datatypes can help you avoiding some repetitions.
If it's not possible to redesign your document mapping, you might check whether dynamic templates can help you avoiding some repetitions
I have a very large volume of documents in ElasticSearch (5.5) which hold recorded data at regular time intervals, let's say every 3 seconds.
{
"#timestamp": "2015-10-14T12:45:00Z",
"channel1": 24.4
},
{
"#timestamp": "2015-10-14T12:48:00Z",
"channel1": 25.5
},
{
"#timestamp": "2015-10-14T12:51:00Z",
"channel1": 26.6
}
Let's say that I need to get results back for a query that asks for the point value every 5 seconds. An interference pattern arises where sometimes there will be an exact match (for simplicity's sake, let's say in the example above that 12:45 is the only sample to land on a multiple of five).
On these times, I want elastic to provide me with the exact value recorded at that time if there is one. So at 12:45 there is a match so it returns value 24.4
In the other cases, I require the last (previously recorded) value. So at 12:50, having no data at that precise time, it would return the value at 12:48 (25.5), being the last known value.
Previously I have used aggregations but in this case this doesnt help because I don't want some average made from a bucket of data, I need either an exact value for an exact time match or a previous value if no match.
I could do this programmatically but performance is a real issue here so I need to come up with the most performant method possible to retrieve the data in the way stated. Returning ALL the elastic data and iterating over the results and checking for a match at each time interval else keeping the item at index i-1 sounds slow and I wonder if it isn't the best way.
Perhaps I am missing a trick with Elastic. Perhaps somebody knows a method to do exactly what I am after?! It would be much appreciated...
The mapping is like so:
"mappings": {
"sampleData": {
"dynamic": "true",
"dynamic_templates": [{
"pv_values_template": {
"match": "GroupId", "mapping": { "doc_values": true, "store": false, "type": "keyword" }
}
}],
"properties": {
"#timestamp": { "type": "date" },
"channel1": { "type": "float" },
"channel2": { "type": "float" },
"item": { "type": "object" },
"keys": { "properties": { "count": { "type": "integer" }}},
"values": { "properties": { "count": { "type": "integer" }}}
}
}
}
and the (NEST) method being called looks like so:
channelAggregation => channelAggregation.DateHistogram("HistogramFilter", histogram => histogram
.Field(dataRecord => dataRecord["#timestamp"])
.Interval(interval)
.MinimumDocumentCount(0)
.ExtendedBounds(start, end)
.Aggregations(aggregation => DataFieldAggregation(channelNames, aggregation)));
#Nikolay there may be up to around 1400 buckets (maximum of one velue to be returned per pixel available on the chart)
I'm facing the problem about performance. My application is about chatting.
I designed mapping index with nested object like below.
{
"conversation_id-v1": {
"mappings": {
"stream": {
"properties": {
"id": {
"type": "keyword"
},
"message": {
"type": "text",
"fields": {
"analyzerName": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "analyzerName"
},
"language": {
"type": "langdetect",
"analyzer": "_keyword",
languages: ["en", "ko", "ja"]
}
}
},
"comments": {
"type": "nested",
"properties": {
"id": {
"type": "keyword"
},
"message": {
"type": "text",
"fields": {
"analyzerName": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "analyzerName"
},
"language": {
"type": "langdetect",
"analyzer": "_keyword",
languages: ["en", "ko", "ja"]
}
}
}
}
}
}
}
}
}
}
** actually have a lot of fields
A document has around 4,000 nested objects. When I upsert data into document, It peak the cpu to 100% also disk i/o in case write. Input ratio around 1000/s.
How can I tuning to improve performance?
Hardware
3x 2vCPUs 13GB on GCP
4000 nested fields sounds like a lot - if I were you, I would look long and hard at your mapping design to be very certain you actually need that many nested fields.
Quoting from the docs:
Internally, nested objects index each object in the array as a separate hidden document.
Since a document has to be fully reindexed on update, you're indexing 4000 documents with a single update.
Why so many fields?
The reason you gave in the comments for needing so many fields
I'd like to search comments in nested and come with their parent stream for display.
makes me think that you may be mixing two concerns here.
ElasticSearch is meant for search, and your mapping should be optimized for search. If your mapping shape is dictated by the way you want to display information, then something is wrong.
Design your index around search
Note that by "search" I mean both indexing and querying.
For the use case you have, it seems like you could:
Index only the comments, with a reference (some id) to the parent stream in the indexed comment document.
After you get the search results (a list of comments) back from the search index, you can retrieve each comment along with its parent stream from some other data source (e.g. a relational database).
The point is, it may be much more efficient to re-retrieve the comment along with whatever else you want from some other source that is more better than ElasticSearch at joining data.