MongoDB multikey index write performance degrading - performance

In MongoDB I have a collection with documents having an array with subdocuments I would like to have an index on:
{
_id : ObjectId(),
members : [
{ ref : ObjectId().str, ... },
{ ref : ObjectId().str, ... },
...
]
}
The index is on the ref field, such that I can quickly find all documents having a particular 'ref' in its members:
db.test.ensureIndex({ "members.ref" : 1 });
I noticed that the performance of pushing an additional subdocument to the array degrades fast as the array length goes above a few thousand. If I instead use an index on an array of strings, the performance does not degrade.
The following code demonstrates the behavior:
var _id = ObjectId("522082310521b655d65eda0f");
function initialize () {
db.test.drop();
db.test.insert({ _id : _id, members : [], memberRefs : [] });
}
function pushToArrays (n) {
var total, err, ref;
total = Date.now();
for (var i = 0; i < n; i++) {
ref = ObjectId().str;
db.test.update({ _id : _id }, { $push : { members : { ref : ref }, memberRefs : ref } });
err = db.getLastError();
if (err) {
throw err;
}
if ((i + 1) % 1000 === 0) {
print("pushed " + (i + 1));
}
}
total = Date.now() - total;
print("pushed " + n + " in " + total + "ms");
}
initialize();
pushToArrays(5000);
db.test.ensureIndex({ "members.ref" : 1 });
pushToArrays(10);
db.test.dropIndexes();
db.test.ensureIndex({ "memberRefs" : 1 });
pushToArrays(10);
db.test.dropIndexes();
E.g., using MongoDB 2.4.6 on my machine I see the following times used to push 10 elements on arrays of length 5000:
Index on "members.ref": 37272ms
Index on "memberRefs": 405ms
That difference seems unexpected. Is this a problem with MongoDB or my use of the multikey index? Is there a recommended way of handling this? Thanks.

Take a look at SERVER-8192 and SERVER-8193. Hopefully that will help answer your question!

Related

Elastic Search - How to return records within time intervals

I have an elastic search db deployed within an AWS VPC. It holds millions of records all with a timestamp added based on the unix datestamp (new Date().getTime()). I am trying to pull (1) record per time slot based on min/max hour and minute values.
Index Mapping:
{ timestamp: "date", ...rest of record }
Elastic Search Query:
let params = {
query: {
bool: {
must: [{
range: {
timestamp: {
gte: (unix date),
lte: (unix date)
}
}
},
{
script: {
script: {
source: "long datestamp = doc['timestamp'].value.getMillis(); " +
"Date dt = new java.util.Date(datestamp*1L); " +
"Calendar instance = Calendar.getInstance(); " +
"instance.setTime(dt); " +
"int hod = instance.get(Calendar.HOUR_OF_DAY); " +
"int tod = instance.get(Calendar.MINUTE); " +
"if (hod >= params.hourMin && hod <= params.hourMax && (hod === params.hourMin && tod >= params.timeMin || hod === params.hourMax && tod <= params.timeMax)) { return true; } else { return false }",
params: {
hourMin: 7,
hourMax: 8,
timeMin: 30,
timeMax: 10
}
}
}
}
]
}
},
from: 0,
size: 500
};
Issue:
I often run into an error while searching indicating that
"dynamic method [java.lang.Long, getMillis/0] not found"
It shows up every 4~5th query generally speaking.
Question:
Is there a better way? I have poured over the elastic search docs regarding intervals, histograms, etc and came up with query above. Not sure if this is the most efficient method nor the most robust.
If this is a community accept approach to find records within an interval then how do I mitigate the errors I am encountering. Do I skip over a specific record or reformat the unix timestamp another way?
Appreciate your support ahead of time.

Problem with jQuery Tablesorter custom-filter having accented characters and special config

I'm facing issue with custom filter defined when they :
have accented characters
textExtraction defined (to set usage of data-sort-value attribute i/o node text)
sortLocalCompare is set to true
Steps to reproduce
In column named '2' (I'm using flaticon in my app), select option "Modéré" or ">= Modéré"
Observed result
The filters doesn't find any result => the table is empty
Expected result
It should find :
1 row (when using option "Modéré") OR
2 rows (when using option ">= Modéré" as "Sérieux" is greater than "Modéré")
Please find the link with the described situation.
When I changed either:
sortLocalCompare:false
comment/remove the textExtraction attribute definition
Both case, one of them is enough, make things working.
Of course, both option to remove doesn't satisfy me as workaround. Because:
Option 1: sortLocalCompare:false when we sort by second column "Société", the company "Bâloise" is then sorted AFTER "BVZ Holding" which is due to the "â".
Option 2: I need the textExtraction function defined as I set integer values to make logic working with ">= Modéré" or also to add multiple integer separated by semicolumn to handle multiple themes to an element (and to have a custom filter listing all themes once)
I tried to make the example as short and comprehensive as possible. This table can be generated in 3 languages (My app is in english, french, german) and the filters are applied with CSS class name to be used in multiple tables accross the application like I do.
Here is the short version of my generic config (multiple tables using it) :
$(function() {
$(".tablesorter").tablesorter({
theme: 'blue',
sortLocaleCompare: true,
widgets: ["filter"],
textExtraction: textExtractionDataSortValue,
filter_onlyAvail: 'filter-onlyAvail',
widgetOptions: {
filter_functions: {
'.filter-controversy': filterControversy,
}
}
});
});
The custom filter function (generated either with english, french or german depending on the user's language):
var filterControversy = {
'Aucun': function(e, n) {
console.info(e + " n=" + n);
return e == '';
},
'Modéré': function(e, n) {
console.info(e + " n=" + n);
return e == 101;
},
' >=Modéré': function(e, n) {
console.info(e + " n=" + n);
return e >= 101;
},
'Serieux': function(e, n) {
console.info(e + " n=" + n);
return e == 102;
},
' >=Sérieux': function(e, n) {
return e >= 102;
},
'Sévère': function(e, n) {
return e == 106;
},
'Majeur': function(e, n) {
console.info(e + " n=" + n);
return e == 103;
},
'Tous': function(e, n) {
return e != '';
}
}
Thanks for your help
Tablesorter version : 2.31.3 (latest)
So you are correct about the behavior of the sortLocaleCompare causing the problem. What is happening is the filter function name is getting the accents removed. In order to solve this, you'll need to change the function name to include both the non-accent name (used for the function) along with the accented name (shown to the users) demo
You should only need to change the filterControversy object as follows:
var filterControversy = {
'Aucun': function(e, n) {
return e == '';
},
'Modere|Modéré': function(e, n) {
return e == 101;
},
' >=Modere| >=Modéré': function(e, n) {
return e >= 101;
},
'Serieux': function(e, n) {
return e == 102;
},
' >=Serieux| >=Sérieux': function(e, n) {
return e >= 102;
},
'Severe|Sévère': function(e, n) {
return e == 106;
},
'Majeur': function(e, n) {
return e == 103;
},
'Tous': function(e, n) {
return e != '';
}
};
The | separator can be changed using the filter_selectSource widget option

Elasticsearch scripted_metric null_pointer_exception

I'm trying to use the scripted_metric aggs of Elasticsearch and normally, it's working perfectly fine with my other scripts
However, with script below, I'm encountering an error called "null_pointer_exception" but they're just copy-pasted scripts and working for 6 modules already
$max = 10;
{
"query": {
"match_all": {}
//omitted some queries here, so I just turned it into match_all
}
},
"aggs": {
"ARTICLE_CNT_PDAY": {
"histogram": {
"field": "pub_date",
"interval": "86400"
},
"aggs": {
"LATEST": {
"nested": {
"path": "latest"
},
"aggs": {
"SUM_SVALUE": {
"scripted_metric": {
"init_script": "
state.te = [];
state.g = 0;
state.d = 0;
state.a = 0;
",
"map_script": "
if(state.d != doc['_id'].value){
state.d = doc['_id'].value;
state.te.add(state.a);
state.g = 0;
state.a = 0;
}
state.a = doc['latest.soc_mm_score'].value;
",
"combine_script": "
state.te.add(state.a);
double count = 0;
for (t in state.te) {
count += ((t*10)/$max)
}
return count;
",
"reduce_script": "
double count = 0;
for (a in states) {
count += a;
}
return count;
"
}
}
}
}
}
}
}
}
I tried running this script in Kibana, and here's the error message:
What I'm getting is, that there's something wrong with the reduce_script portion, tried to change this part:
FROM
for (a in states) {
count += a;
}
TO
for (a in states) {
count += 1;
}
And worked perfectly fine, I felt that the a variable isn't getting what it's supposed to hold
Any ideas here? Would appreciate your help, thank you very much!
The reason is explained here:
If a parent bucket of the scripted metric aggregation does not collect any documents an empty aggregation response will be returned from the shard with a null value. In this case the reduce_script's states variable will contain null as a response from that shard. reduce_script's should therefore expect and deal with null responses from shards.
So obviously one of your buckets is empty, and you need to deal with that null like this:
"reduce_script": "
double count = 0;
for (a in states) {
count += (a ?: 0);
}
return count;
"

display calculated data with CouchDB and PouchDB

I'm trying to understand how to return calculated data on docs using CouchDB and PouchDB.
Say I have two types of docs on my CouchDB: Blocks and Reports.
Reports consists of: report_id, block_id and date.
Block consists of: block_id and name.
I'd like to calculate for each block it's last report_id (the id of the most recent report), and return it with block's doc.
Is there a way to achieve that?
I'm assuming that a View of some type will do the trick but I can't figure it out.
You can do this with map/reduce functions in CouchDB.
Let's say you have those documents :
{
"_id": "report_1",
"type": "report",
"block_id": "block_1",
"date": "1500325245"
}
{
"_id": "report_2",
"type": "report",
"block_id": "block_1",
"date": "1153170045"
}
You would like to get the reports with the highest timestamp (in this case, repot_1).
We start by creating a map function that will map the documents with the bloc_id as the key and the timestamp+ report id as the value for reduce function.
Map :
function (doc) {
if(doc.type == "report")
emit(doc.block_id,{date:doc.created,report:doc._id});
}
Then, we will create a reduce function. When rereduce is false, we will simply return the values. When rereduce is true, we will find the maximum timestamp and return the report id associated to it
Reduce function :
function (keys, values, rereduce) {
if (rereduce) {
var max = 0;
var maxReportId = -1;
for (var i = 0; i < values.length; i++) {
var val = values[i][0];
if (parseInt(val.date) > max) {
max = val.date;
maxReportId = val.report;
}
}
//We return the report id of the most recent report.
return maxReportId;
} else
return values;
}

Script to return array for scripted metric aggregation from combine

For scripted metric aggregation , in the example shown in the documentation , the combine script returns a single number.
Instead here , can i pass an array or hash ?
I tried doing it , though it did not return any error , i am not able to access those values from reduce script.
In reduce script per shard i am getting an instance when converted to string read as 'Script2$_run_closure1#52ef3bd9'
Kindly let me know , if this can be accomplished in any way.
At least for Elasticsearch version 1.5.1 you can do so.
For example, we can modify Elasticsearch example (scripted metric aggregation) to receive an average profit (profit divided by number of transactions):
{
"query": {
"match_all": {}
},
"aggs": {
"avg_profit": {
"scripted_metric": {
"init_script": "_agg['transactions'] = []",
"map_script": "if (doc['type'].value == \"sale\") { _agg.transactions.add(doc['amount'].value) } else { _agg.transactions.add(-1 * doc['amount'].value) }",
"combine_script": "profit = 0; num_of_transactions = 0; for (t in _agg.transactions) { profit += t; num_of_transactions += 1 }; return [profit, num_of_transactions]",
"reduce_script": "profit = 0; num_of_transactions = 0; for (a in _aggs) { profit += a[0] as int; num_of_transactions += a[1] as int }; return profit / num_of_transactions as float"
}
}
}
}
NOTE: this is just a demo for an array in the combine script, you can calculate average easily without using any arrays.
The response will look like:
"aggregations" : {
"avg_profit" : {
"value" : 42.5
}
}

Resources