Do you get the same performance using index prefixes? - performance

Say I have a collection containing documents like the one below:
{
_id: ObjectId(),
myValue: 123,
otherValue: 456
}
I then create like below:
{myValue: 1, otherValue: 1}
If I execute the following query:
db.myCollection.find({myValue: 123})
will I get the same performance with my index as I would if I have an index on only the myValue field? Or is the performance degraded some how since it is using an index Prefix?

A "compound index" which is the correct term for your "link" does not create any performance problems on "read" ( since writing new entries is obviously more information ) than an index just on the single field used in the query. With one exception.
If you use a "multi-Key" index which means an "array" item as part of the index then you effectively create n more items in the index per key. As in:
{ "a": 1, "b": [ 1, 2, 3 ] }
An index on { "a": 1, "b": 1 } means this in basic terms:
{ "a": 1, "b": 1 },
{ "a": 1, "b": 2 },
{ "a": 1, "b": 3 }
So basically one index entry per array element to be scanned.
But otherwise the single element does not affect performance with the general exclusion of the "obvious" need to load a structure that contains more data than what you "need to use" into memory per element.
So if you don't need it then don't use it. And creating "two" indexes ( one for compound one for single field ) might save you memory, but it will "cost" you in write performance and storage space in general.

Related

Elasticsearch re-index all vs join

I'm pretty new on Elasticsearch and all its concepts. I would like to understand how I could accomplish what I have in my Relational DB in an Elasticsearch architecture.
The scenario is the following
I have a index "data":
{
"id": "00001",
"content" : "some text here ..",
"type": "T1",
"categories: ["A", "A1", "B"]
}
The requirement says that data can be queried by:
some text search in the context field
that belongs to a specific type or category
So far, so simple, so good.
This data will not be completed from the creating time. It might happen that new categories will be added/removed to the data later. So, many data uploads/re-indexes might happen along the way
For example:
create the data
{
"id": "00001",
"content" : "some text here ..",
"type": "T1",
"categories: ["A"]
}
Then it was decided that all data with type=T1 must belong to both A & B categories.
{
"id": "00001",
"content" : "some text here ..",
"type": "T1",
"categories: ["A", "B"]
}
If I have a billion hits for type=T1 I would have to update/re-index a billion entries. Maybe it is how things should work and this where my question lands on.
Is ok to re-index all the data just to add/remove a new category, or would it be possible to have a second much smaller index just to do this association and somehow join both indexes at time to query?
Something like it:
Data:
{
"id": "00001",
"content" : "some text here ..",
"type": "T1"
}
DataCategories:
{
"type": "T1"
"categories" : ["A", "B"]
}
Is it acceptable/possible?
This is a common scenario - but unfortunately, there is no 1:1 mapping for RDBMS features in text search engines like Lucene/elasticsearch.
Possible options:
1 - For the best performance, reindex. It may not be practical depending on the velocity of your change
2 - Consider Parent-Child; Though it's a slower option - often will meet performance requirements. The category could be a parent document, each having several thousands of children.
3 - If its category renaming - Consider using IDs for the category and translating it to text in the application.
4 - Update document depends on the number of documents to be updated; maybe for few thousand - run an update query, if more - reindex.
Suggested reading - https://www.elastic.co/blog/managing-relations-inside-elasticsearch

Elastic Ingest Pipeline split field and create a nested field

Dear freindly helpers,
I have an index that is fed by a database via Kafka. Now this database holds a field that aggregates a couple of pieces of information like so key/value; key/value; (don't ask for the reason, I have no idea who designed it liked that and why ;-) )
93/4; 34/12;
it can be empty, or it can hold 1..n key/value pairs.
I want to use an ingest pipeline and ideally have a "nested" field which holds all values that are in tha field.
Probably like this:
{"categories":
{ "93": 7,
"82": 4
}
}
The use case is the following: we want to visualize the sum of a filtered number of these categories (they tell me how many minutes a specific process took longer) and relate them in ranges.
Example: I filter categories x, y ,z and then group how many documents for the day had no delay, which had a delay up to 5 minutes and which had a delay between 5 and 15 minutes.
I have tried to get the fields neatly separated with the kv processor and wanted to work from there on but it was a complete wrong approach I guess.
"kv": {
"field": "IncomingField",
"field_split": ";",
"value_split": "/",
"target_field": "delays",
"ignore_missing": true,
"trim_key": "\\s",
"trim_value": "\\s",
"ignore_failure": true
}
When I test the pipeline it seems ok
"delays": {
"62": "3",
"86": "2"
}
but there are two things that don't work.
I can't know upfront how many of these combinations I have and thus converting the values from string t int in the same pipeline is an issue.
When I want to create a kibana index pattern I end up with many fields like delay.82 and delay.82.keyword which does not make sense at all for the usecase as I can't filter (get only the sum of delays where the key is one of x,y,z) and aggregate.
I have looked into other processors (dorexpander) but can't really get my head around how to get this working.
I hope my question is clear (I lack english skills, sorry) and that someone can point me at the right direction.
Thank you very much!
You should rather structure them as an array of objects with shared accessors, for instance:
[ {key: 93, value: 7}, ...]
That way, you'll be able to aggregate on categories.key and categories.value.
So this means iterating the categories' entrySet() using a custom script processor like so:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "extracts k/v pairs",
"processors": [
{
"script": {
"source": """
def categories = ctx.categories;
def kv_pairs = new ArrayList();
for (def pair : categories.entrySet()) {
def k = pair.getKey();
def v = pair.getValue();
kv_pairs.add(["key": k, "value": v]);
}
ctx.categories = kv_pairs;
"""
}
}
]
},
"docs": [
{
"_source": {
"categories": {
"82": 4,
"93": 7
}
}
}
]
}
P.S.: Do make sure your categories field is mapped as nested b/c otherwise you'll lose the connections between the keys & the values (also called flattening).

Maps vs Lists in Elasticsearch for optimized query performance

I have some data I will be putting into Elasticsearch, and want to decide on a format that will optimize query performance. The query will be in words: "Is ID X in category Y?". I have a fixed number of categories (small, say, 5), and possibly a large number of IDs to put into each category (currently in the dozens, but of indeterminate size in the future). Each ID will be in at most one category (possibly none).
Format 1:
{
"field1": "value1",
...
"categories": {
"category1": ["id10", "id24", "id38",...],
...
"category5": ["id62", "id19", "id82" ...]
}
}
or
Format 2:
{
"field1": "value1",
...
"categories": {
"id1": "category4",
"id2": "category2",
"id3": "category1",
...
}
}
Which data format would be preferred? The latter format has linear lookup time, but possibly many keys.
I think method 1 is better, Id will be more in the future, if you press method 2, then you may need to close the categories index or increase the number of index fields, and using method 1 can be more convenient to determine the type of a single id (indeOf).There are pros and cons. Maybe there's a better way.

Elasticsearch performance impact on choosing mapping structure for index

I am receiving data in a format like,
{
name:"index_name",
status: "good",
datapoints: [{
paramType: "ABC",
batch: [{
time:"timestamp1<epoch in sec>",
value: "123"
},{
time:"timestamp2<epoch in sec>",
value: "123"
}]
},
{
paramType: "XYZ",
batch: [{
time:"timestamp1<epoch in sec>",
value: "123"
},{
time:"timestamp2<epoch in sec>",
value: "124"
}]
}]
}
I would like to store the data into elasticsearch in such a way that I can query based on a timerange, status or paramType.
As mentioned here, I can define datapoints or batch as a nested data type which will allow to index object inside the array.
Another way, I can possibly think is by dividing the structure into separate documents. e.g.
{
name : "index_name",
status: "good",
paramType:"ABC",
time:"timestamp<epoch in sec>",
value: "123"
}
which one will be the most efficient way?
if I choose the 2nd way, I know there may be ~1000 elements in the batch array and 10-15 paramsType array, which means ~15k documents will be generated and 15k*5 fields (= 75K) key values pair will be repeated in the index?
Here this explains about the advantage and disadvantage of using nested but no performance related stats provided. in my case, there won't be any update in the inner object. So not sure which one will be better. Also, I have two nested objects so I would like to know how can I query if I use nested for getting data between a timerange?
Flat structure will perform better than nested. Nested queries are slower compared to term queries ; Also while indexing - internally a single nested document is represented as bunch of documents ; just that they are indexed in same block .
As long as your requirements are met - second option works better.

Very strange -- Adding a compound index makes queries much slower (MongoDB)

I'm having a problem that should be very simple but I'm stumped on this one -- maybe I'm misunderstanding something about compound indexes in MongoDB.
To reproduce this problem, I have created a simple collection with 500000 entries and six fields, each with a random number. In a mongo terminal, I generated the collection like this:
for(i = 0; i < 500000; i++){
db.test.save({a: Math.random(), b: Math.random(), c: Math.random(), d: Math.random(), e: Math.random() })
}
Then, I time a simple query on this collection like this:
t1 = new Date().getTime()
db.test.count({a : {$gt: 0.5}, b : {$gt: 0.5}, c : {$gt: 0.5}, d : {$gt: 0.5}, e : {$gt: 0.5} })
t2 = new Date().getTime()
t2-t1
=> 335ms
The query completed in 335 ms. So now I add a compound index to try to make the query faster:
db.test.ensureIndex({a: 1, b:1 ,c:1 ,d:1, e:1})
The query should be faster now, but running the exact same query takes longer:
t1 = new Date().getTime()
db.test.count({a : {$gt: 0.5}, b : {$gt: 0.5}, c : {$gt: 0.5}, d : {$gt: 0.5}, e : {$gt: 0.5} })
t2 = new Date().getTime()
t2-t1
=> 762ms
The same query takes over twice as long when the index is added! This is repeatable even when I try this multiple times. Removing the index with db.test.dropIndexes() makes the query run faster again, back to ~350ms.
Checking the queries with explain() shows that a BasicCursor is used before the index is added. After the index is added a BtreeCursor is used and has the expected indexBounds.
So my question is: why is this happening? And more importantly, how DO I get this query to run faster? In a SQL benchmark that I did on the same machine, an analogous query with SQL took ~240ms without an index, with an index dropping that down to ~180ms.
My MongoDB version info:
> mongo --version
MongoDB shell version: 2.6.3
The problem with your example here is basically that the data in indeed far "too random" in order to make effective use of an index in this case. The result is as expected since there is not much "order" in how an index can traverse this, along with the consideration that as you are indexing every field in the document the index size will be somewhat larger than the document itself.
For a better representation of a "real world" situation you can look at a more 50/50 split of the relevant data to search for. Here with a more optimized form of generator:
var samples = [{ "a": "a", "b": "a" },{ "a": "b", "b": "b" }];
for ( var x = 0; x < 5; x++ ) {
samples.forEach(function(s) {
var batch = [];
for(i = 0; i < 10000; i++){
batch.push( s );
}
db.test.insert(batch);
});
}
That inserts the data with a fair enough representation that either search would essentially have to scan through every document in the collection in certainty to retrieve them all in absence of an index.
So if you look a a query now with a form to get 50% of the data:
db.test.find({ "a": 1, "b": 1 }).explain()
On my hardware where I am sitting, even warmed up that is going to consistently take over 100ms to complete. But when you add an index to both fields:
db.test.ensureIndex({ "a": 1, "b": 1 })
Then the same query consistently completes under 100ms, and mostly around the 90ms mark. This also gets a little more interesting when you add some projection in order to force the stats to "index only":
db.test.find({ "a": 1, "b": 1 },{ "_id", "a": 1, "b": 1 }).explain()
Now while this does not need to go back to the documents in this case and is marked as "indexOnly": true, the working set size is likely small enough to fit in memory and thus you see a slight performance degradation due to the extra work "projecting" the fields. The average now with the index is around 110ms on the hardware. But when you drop the index:
db.test.dropIndexes()
The performance of the query without the use of an index drops to 170ms. This shows the overhead in projection against the benefits of the index more clearly.
Pulling the index back to the form as you had originally:
db.test.ensureIndex({ "a": 1, "b": 1, "c": 1, "d": 1, "e": 1 })
Keeping the same projection query you get around 135ms with the index and of course the same 170ms without. Now if you then go back to the original query form:
db.test.find({ "a": 1, "b": 1, "c": 1, "d":1, "e": 1}).explain()
The results with the index are still around the 135ms mark and the non-indexed query is skipping around the 185ms mark.
So it does make sense that real world data distribution is not typically so "random" as the test you designed. Though it is also true that distribution is almost never as clear cut as 50/50, the general case is there is not in fact so much scatter and there tends to be natural clusters of the ranges you are looking for.
This also serves as an example that with "truly random" data with a high level of distribution between values, then b-tree indexes are not the most optimal way to address the accessing of data.
I hope that makes some of the points to consider about this more clear to you.
Here is another sample closer to your original test, the only difference is altering the "precision" so the data is not so "random" which was one of the main points I am making:
var batch = []
for( i = 0; i < 500000; i++){
batch.push({
"a": Math.round(Math.random()*100)/100,
"b": Math.round(Math.random()*100)/100,
"c": Math.round(Math.random()*100)/100,
"d": Math.round(Math.random()*100)/100,
"e": Math.round(Math.random()*100)/100
});
if ( batch.length % 10000 == 0 ) {
db.test.insert( batch );
batch = [];
}
}
So there is a "two decimal place precision" in the data being enforced which again represents real world data cases more directly. Also note that the inserts are not being done on every iteration, as the implementation of insert for the shell in MongoDB 2.6 will return the "write concern" response with every update. So much faster to set up.
If you then consider your original test query, the response without an index will take around 590ms to complete as per my hardware. When you add the same index the query completes in 360ms.
If you do that on just "a" and "b" without an index:
db.test.find({ "a": {"$gt": 0.5}, "b": {"$gt": 0.5} }).explain()
The response comes in at around 490ms. Adding an index to just "a" and "b"
db.test.ensureIndex({ "a": 1, "b": 1 })
And the indexed query takes around 300ms, so still considerably faster.
Everything here says essentially:
Natural distribution is supported very well with B-tree indexes, fully random is not.
Index what you need to query on an those fields only. There is a size cost and there is a memory cost as well.
From that second point there is one more thing to demonstrate as most examples here are generally required to look up the document from the collection as well as find it in the index. The obvious cost here is that both the index and the collection need to be paged into memory in order to return the results. This of course takes time.
Consider the full compound index in place with the following query, the response without the index takes around 485ms:
db.test.find({ "a": {"$gt": 0.5}, "b": {"$gt": 0.5} }).explain()
Adding the compound index on "a" through "e" makes the same query with the index in place run around 385ms.Still faster, but slower than our full query, but there is a good reason why considering the index contains all of the fields and the conditions. But if you alter that with a projection for only the required fields:
db.test.find(
{ "a": {"$gt": 0.5}, "b": {"$gt": 0.5} },
{ "_id": 0, "a": 1, "b": 1 }
).explain()
That drops the time somewhat and now the index is used soley to get the results. Dropping the index and issuing the same query takes around 650ms with the additional overhead of the projection. This shows that an effective index actually does make a lot of difference to the results.

Resources