Very strange -- Adding a compound index makes queries much slower (MongoDB) - performance

I'm having a problem that should be very simple but I'm stumped on this one -- maybe I'm misunderstanding something about compound indexes in MongoDB.
To reproduce this problem, I have created a simple collection with 500000 entries and six fields, each with a random number. In a mongo terminal, I generated the collection like this:
for(i = 0; i < 500000; i++){
db.test.save({a: Math.random(), b: Math.random(), c: Math.random(), d: Math.random(), e: Math.random() })
}
Then, I time a simple query on this collection like this:
t1 = new Date().getTime()
db.test.count({a : {$gt: 0.5}, b : {$gt: 0.5}, c : {$gt: 0.5}, d : {$gt: 0.5}, e : {$gt: 0.5} })
t2 = new Date().getTime()
t2-t1
=> 335ms
The query completed in 335 ms. So now I add a compound index to try to make the query faster:
db.test.ensureIndex({a: 1, b:1 ,c:1 ,d:1, e:1})
The query should be faster now, but running the exact same query takes longer:
t1 = new Date().getTime()
db.test.count({a : {$gt: 0.5}, b : {$gt: 0.5}, c : {$gt: 0.5}, d : {$gt: 0.5}, e : {$gt: 0.5} })
t2 = new Date().getTime()
t2-t1
=> 762ms
The same query takes over twice as long when the index is added! This is repeatable even when I try this multiple times. Removing the index with db.test.dropIndexes() makes the query run faster again, back to ~350ms.
Checking the queries with explain() shows that a BasicCursor is used before the index is added. After the index is added a BtreeCursor is used and has the expected indexBounds.
So my question is: why is this happening? And more importantly, how DO I get this query to run faster? In a SQL benchmark that I did on the same machine, an analogous query with SQL took ~240ms without an index, with an index dropping that down to ~180ms.
My MongoDB version info:
> mongo --version
MongoDB shell version: 2.6.3

The problem with your example here is basically that the data in indeed far "too random" in order to make effective use of an index in this case. The result is as expected since there is not much "order" in how an index can traverse this, along with the consideration that as you are indexing every field in the document the index size will be somewhat larger than the document itself.
For a better representation of a "real world" situation you can look at a more 50/50 split of the relevant data to search for. Here with a more optimized form of generator:
var samples = [{ "a": "a", "b": "a" },{ "a": "b", "b": "b" }];
for ( var x = 0; x < 5; x++ ) {
samples.forEach(function(s) {
var batch = [];
for(i = 0; i < 10000; i++){
batch.push( s );
}
db.test.insert(batch);
});
}
That inserts the data with a fair enough representation that either search would essentially have to scan through every document in the collection in certainty to retrieve them all in absence of an index.
So if you look a a query now with a form to get 50% of the data:
db.test.find({ "a": 1, "b": 1 }).explain()
On my hardware where I am sitting, even warmed up that is going to consistently take over 100ms to complete. But when you add an index to both fields:
db.test.ensureIndex({ "a": 1, "b": 1 })
Then the same query consistently completes under 100ms, and mostly around the 90ms mark. This also gets a little more interesting when you add some projection in order to force the stats to "index only":
db.test.find({ "a": 1, "b": 1 },{ "_id", "a": 1, "b": 1 }).explain()
Now while this does not need to go back to the documents in this case and is marked as "indexOnly": true, the working set size is likely small enough to fit in memory and thus you see a slight performance degradation due to the extra work "projecting" the fields. The average now with the index is around 110ms on the hardware. But when you drop the index:
db.test.dropIndexes()
The performance of the query without the use of an index drops to 170ms. This shows the overhead in projection against the benefits of the index more clearly.
Pulling the index back to the form as you had originally:
db.test.ensureIndex({ "a": 1, "b": 1, "c": 1, "d": 1, "e": 1 })
Keeping the same projection query you get around 135ms with the index and of course the same 170ms without. Now if you then go back to the original query form:
db.test.find({ "a": 1, "b": 1, "c": 1, "d":1, "e": 1}).explain()
The results with the index are still around the 135ms mark and the non-indexed query is skipping around the 185ms mark.
So it does make sense that real world data distribution is not typically so "random" as the test you designed. Though it is also true that distribution is almost never as clear cut as 50/50, the general case is there is not in fact so much scatter and there tends to be natural clusters of the ranges you are looking for.
This also serves as an example that with "truly random" data with a high level of distribution between values, then b-tree indexes are not the most optimal way to address the accessing of data.
I hope that makes some of the points to consider about this more clear to you.
Here is another sample closer to your original test, the only difference is altering the "precision" so the data is not so "random" which was one of the main points I am making:
var batch = []
for( i = 0; i < 500000; i++){
batch.push({
"a": Math.round(Math.random()*100)/100,
"b": Math.round(Math.random()*100)/100,
"c": Math.round(Math.random()*100)/100,
"d": Math.round(Math.random()*100)/100,
"e": Math.round(Math.random()*100)/100
});
if ( batch.length % 10000 == 0 ) {
db.test.insert( batch );
batch = [];
}
}
So there is a "two decimal place precision" in the data being enforced which again represents real world data cases more directly. Also note that the inserts are not being done on every iteration, as the implementation of insert for the shell in MongoDB 2.6 will return the "write concern" response with every update. So much faster to set up.
If you then consider your original test query, the response without an index will take around 590ms to complete as per my hardware. When you add the same index the query completes in 360ms.
If you do that on just "a" and "b" without an index:
db.test.find({ "a": {"$gt": 0.5}, "b": {"$gt": 0.5} }).explain()
The response comes in at around 490ms. Adding an index to just "a" and "b"
db.test.ensureIndex({ "a": 1, "b": 1 })
And the indexed query takes around 300ms, so still considerably faster.
Everything here says essentially:
Natural distribution is supported very well with B-tree indexes, fully random is not.
Index what you need to query on an those fields only. There is a size cost and there is a memory cost as well.
From that second point there is one more thing to demonstrate as most examples here are generally required to look up the document from the collection as well as find it in the index. The obvious cost here is that both the index and the collection need to be paged into memory in order to return the results. This of course takes time.
Consider the full compound index in place with the following query, the response without the index takes around 485ms:
db.test.find({ "a": {"$gt": 0.5}, "b": {"$gt": 0.5} }).explain()
Adding the compound index on "a" through "e" makes the same query with the index in place run around 385ms.Still faster, but slower than our full query, but there is a good reason why considering the index contains all of the fields and the conditions. But if you alter that with a projection for only the required fields:
db.test.find(
{ "a": {"$gt": 0.5}, "b": {"$gt": 0.5} },
{ "_id": 0, "a": 1, "b": 1 }
).explain()
That drops the time somewhat and now the index is used soley to get the results. Dropping the index and issuing the same query takes around 650ms with the additional overhead of the projection. This shows that an effective index actually does make a lot of difference to the results.

Related

Binning Data With Two Timestamps

I'm posting because I have found no content surrounding this topic.
My goal is essentially to produce a time-binned graph that plots some aggregated value. For Example. Usually this would be a doddle, since there is a single timestamp for each value, making it relatively straight forward to bin.
However, my problem lies in having two timestamps for each value - a start and an end. Similar to a gantt chart, here is an example of my plotted data. I essentially want to bin the values (average) for when the timelines exist within said bin (bin boundaries could be where a new/old task starts/ends). Likeso.
I'm looking for a basic example or an answer to whether this is even supported, in Vega-Lite. My current working example would yield no benefit to this discussion.
I see that you found a Vega solution, but I think in Vega-Lite what you were looking for was something like the following. You put the start field in "x" and the end field in x2, add bin and type to x and all should work.
"encoding": {
"x": {
"field": "start_time",
"bin": { "binned": true },
"type": "temporal",
"title": "Time"
},
"x2": {
"field": "end_time"
}
}
I lost my old account, but I was the person who posted this. Here is my solution to my question. The value I am aggregating here is the sum of times the timelines for each datapoint is contained within each bin.
First you want to use a join aggregate to get the max and min times your data extend to. You could also hardcode this.
{
type: joinaggregate
fields: [
startTime
endTime
]
ops: [
min
max
]
as: [
min
max
]
}
You want to find a step for your bins, you can hard code this later or use a formula and write this into a new field.
You want to create two new fields in your data that is a sequence between the max and min, and the other the same sequence offset by your step.
{
type: formula
expr: sequence(datum.min, datum.max, datum.step)
as: startBin
}
{
type: formula
expr: sequence(datum.min + datum.step, datum.max + datum.step, datum.step)
as: endBin
}
The new fields will be arrays. So if we go ahead and use a flatten transform we will get a row for each data value in each bin.
{
type: flatten
fields: [
startBin
endBin
]
}
You then want to calculate the total time your data spans across each specific bin. In order to do this you will need to round up the start time to the bin start and round down the end time to the bin end. Then taking the difference between the start and end times.
{
type: formula
expr: if(datum.startTime<datum.startBin, datum.startBin, if(datum.startTime>datum.endBin, datum.endBin, datum.startTime))
as: startBinTime
}
{
type: formula
expr: if(datum.endTime<datum.startBin, datum.startBin, if(datum.endTime>datum.endBin, datum.endBin, datum.endTime))
as: endBinTime
}
{
type: formula
expr: datum.endBinTime - datum.startBinTime
as: timeInBin
}
Finally, you just need to aggregate the data by the bins and sum up these times. Then your data is ready to be plotted.
{
type: aggregate
groupby: [
startBin
endBin
]
fields: [
timeInBin
]
ops: [
sum
]
as: [
timeInBin
]
}
Although this solution is long, it is relatively easily to implement in the transform section of your data. From my experience this runs fast and just displays how versatile Vega can be. Freedom to visualisations!

Elastic Ingest Pipeline split field and create a nested field

Dear freindly helpers,
I have an index that is fed by a database via Kafka. Now this database holds a field that aggregates a couple of pieces of information like so key/value; key/value; (don't ask for the reason, I have no idea who designed it liked that and why ;-) )
93/4; 34/12;
it can be empty, or it can hold 1..n key/value pairs.
I want to use an ingest pipeline and ideally have a "nested" field which holds all values that are in tha field.
Probably like this:
{"categories":
{ "93": 7,
"82": 4
}
}
The use case is the following: we want to visualize the sum of a filtered number of these categories (they tell me how many minutes a specific process took longer) and relate them in ranges.
Example: I filter categories x, y ,z and then group how many documents for the day had no delay, which had a delay up to 5 minutes and which had a delay between 5 and 15 minutes.
I have tried to get the fields neatly separated with the kv processor and wanted to work from there on but it was a complete wrong approach I guess.
"kv": {
"field": "IncomingField",
"field_split": ";",
"value_split": "/",
"target_field": "delays",
"ignore_missing": true,
"trim_key": "\\s",
"trim_value": "\\s",
"ignore_failure": true
}
When I test the pipeline it seems ok
"delays": {
"62": "3",
"86": "2"
}
but there are two things that don't work.
I can't know upfront how many of these combinations I have and thus converting the values from string t int in the same pipeline is an issue.
When I want to create a kibana index pattern I end up with many fields like delay.82 and delay.82.keyword which does not make sense at all for the usecase as I can't filter (get only the sum of delays where the key is one of x,y,z) and aggregate.
I have looked into other processors (dorexpander) but can't really get my head around how to get this working.
I hope my question is clear (I lack english skills, sorry) and that someone can point me at the right direction.
Thank you very much!
You should rather structure them as an array of objects with shared accessors, for instance:
[ {key: 93, value: 7}, ...]
That way, you'll be able to aggregate on categories.key and categories.value.
So this means iterating the categories' entrySet() using a custom script processor like so:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "extracts k/v pairs",
"processors": [
{
"script": {
"source": """
def categories = ctx.categories;
def kv_pairs = new ArrayList();
for (def pair : categories.entrySet()) {
def k = pair.getKey();
def v = pair.getValue();
kv_pairs.add(["key": k, "value": v]);
}
ctx.categories = kv_pairs;
"""
}
}
]
},
"docs": [
{
"_source": {
"categories": {
"82": 4,
"93": 7
}
}
}
]
}
P.S.: Do make sure your categories field is mapped as nested b/c otherwise you'll lose the connections between the keys & the values (also called flattening).

restructure elasticsearch index to allow filtering on sum of values

I've an index of products.
Each product, has several variants (can be a few or hundreds, each has a color & size e.g. Red)
Each variant, is available (in a certain quantity) at several warehouses (aronud 100 warehouses).
Warehouses have codes e.g. AB, XY, CD etc.
If I had my choice, I'd index it as:
stock: {
Red: {
S: { AB: 100, XY: 200, CD: 20 },
M: { AB: 0, XY: 500, CD: 20 },
2XL: { AB: 5, XY: 0, CD: 9 }
},
Blue: {
...
}
}
Here's a kind of customer query I might receive:
Show me all products, that have Red.S color in stock (minimum 100) at warehouses AB & XY.
So this would probably be a filter like
Red.S.AB > 100 AND Red.S.XY > 100
I'm not writing whole filter query here, but its straightforward in elastic.
We might also get SUM queries, e.g. the sum of inventories at AB & XY should be > 500.
That'd be easy through a script filter, say Red.S.AB + Red.S.XY > 500
The problem is, given 100 warehouses, 100 sizes, 25 colors, this easily needs 100*100*25 = 250k mappings. Elasticsearch simply can't handle that many number of keys.
The easy answer is use nested documents, but nested documents pose a particular problem. We cannot sum across a given selection of nested documents, and nested docs are slow, specially when we're going to have 250k per product.
I'm open to external solutions than elastic as well. We're rails/postgres stack.
You have your product index with variants, that's fine, but I'd use another index for managing anything related to the multi-warehouse stock. One document per product/size/color/warehouse with the related count. For instance:
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "AB",
"quantity": 100
}
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "XY",
"quantity": 200
}
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "CD",
"quantity": 20
}
etc...
That way, you'll be much more flexible with your stock queries, because all you'll need is to filter on the fields (product, color, size, warehouse) and simply aggregate on the quantity field, sums, averages or whatever you might think of.
You will probably need to leverage the bucket_script pipeline aggregation in order to decide whether sums are above or below a desired threshold.
It's also much easier to maintain the stock movements by simply indexing the new quantity for any given combination than having to update the master product document every time an item gets out of the stock.
No script, no nested documents required.
The best possible solution will be to create separate indexes for the warehouses and each warehouse index will have documents. One document per product/size/color/warehouse with related values like this:
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "AB",
"quantity": 100
}
This will reduce your mappings 100 * 25 = 2500 mappings per index.
Rest for other operations, I feel #Val has mentioned in his answer which is quite impressive and beautiful.
Coming to external solutions, I would say you want to carry to out task of storing data, searching it and fetching it. Elasticsearch and Apache Solr are the best search engines to carry out these kind of tasks. I have not tried Apache Solr but I would highly recommend to go with Elasticsearch because of it's features, active community support and searching is really fast. Searching can also be made fast using analyzers and tokenizers. It also has some features like Full-Text Searching and Term Level Searching to customize searching according to situation or problem statement.

Randomly selecting document ArangoDB often gives the same results

The other day I saw a method for querying for a random document from a collection using AQL on this very same website:
Randomly select a document in ArangoDB
My implementation of this at the moment is:
//brands
let b1 = (
for brand in brands
filter brand.brand == #brand1
return brand._id
)
//pick random car with brand 1
let c1 = (
for edge in edges
filter edge._from == b1[0]
for car in cars
filter car._id == edge._to
sort rand() limit 1
return car._id
)
However, when I use that method it can hardly be called 'random'. For instance, in a 3500+ document collection I manage to get the same document 5 times in a row, and over the course of 25+ attempts there're maybe 3 to 4 documents that keep being returned to me. It seems the method is geared towards particular documents being output. I was wondering if there's still some improvement to be done here or another method that wasn't mentioned in that thread. The problem is that I can't comment on the thread yet due to low reputation levels, so I can't ask the question in the same place. However I think it merits a discussion nonetheless. I hope someone can help me out in getting a better randomization.
Essentially the rand() function is being seeded the same on each query execution. Multiple calls within the same query will be different, but the next execution will start back from the same number.
I ran this query and saw the same 3 numbers each time:
return {
"1": rand(),
"2": rand(),
"3": rand()
}
Not always, but more often than not got the same numbers:
[
{
"1": 0.5635853144932401,
"2": 0.19330423902096622,
"3": 0.8087405011139256
}
]
Then, seeded with current milliseconds:
return {
"1": rand() + DATE_MILLISECOND(DATE_NOW()),
"2": rand() + DATE_MILLISECOND(DATE_NOW()),
"3": rand() + DATE_MILLISECOND(DATE_NOW())
}
Now I always get a different number.
[
{
"1": 617.8103840407173,
"2": 617.0999366056549,
"3": 617.6308832757169
}
]
You can use various techniques to produce pseudorandom numbers that won't repeat like calling rand() with the same seed.
Edit: this is actually a Windows bug. If you can use linux you should be fine.

Do you get the same performance using index prefixes?

Say I have a collection containing documents like the one below:
{
_id: ObjectId(),
myValue: 123,
otherValue: 456
}
I then create like below:
{myValue: 1, otherValue: 1}
If I execute the following query:
db.myCollection.find({myValue: 123})
will I get the same performance with my index as I would if I have an index on only the myValue field? Or is the performance degraded some how since it is using an index Prefix?
A "compound index" which is the correct term for your "link" does not create any performance problems on "read" ( since writing new entries is obviously more information ) than an index just on the single field used in the query. With one exception.
If you use a "multi-Key" index which means an "array" item as part of the index then you effectively create n more items in the index per key. As in:
{ "a": 1, "b": [ 1, 2, 3 ] }
An index on { "a": 1, "b": 1 } means this in basic terms:
{ "a": 1, "b": 1 },
{ "a": 1, "b": 2 },
{ "a": 1, "b": 3 }
So basically one index entry per array element to be scanned.
But otherwise the single element does not affect performance with the general exclusion of the "obvious" need to load a structure that contains more data than what you "need to use" into memory per element.
So if you don't need it then don't use it. And creating "two" indexes ( one for compound one for single field ) might save you memory, but it will "cost" you in write performance and storage space in general.

Resources