MongoDB scans entire index when using $all and $elemMatch

MongoDB scans entire index when using $all and $elemMatch - performance

I have a collection of user documents, where each user can have an arbitrary set of properties. Each user is associated to an app document. Here is an example user:
{
"appId": "XXXXXXX",
"properties": [
{ "name": "age", "value": 30 },
{ "name": "gender", "value": "female" },
{ "name": "alive", "value": true }
]
}
I would like to be able to find/count users based on the values of their properties. For example, find me all users for app X that have property Y > 10 and Z equals true.
I have a compound, multikey index on this collection db.users.ensureIndex({ "appId": 1, "properties.name": 1, "properties.value": 1}). This index is working well for single condition queries, ex:
db.users.find({
appId: 'XXXXXX',
properties: {
$elemMatch: {
name: 'age',
value: {
$gt: 10
}
}
}
})
The above query completes in < 300ms with a collection of 1M users. However, when I try and add a second condition, the performance degrades considerably (7-8s), and the explain() output indicates that the whole index is being scanned to fulfill the query ("nscanned" : 2752228).
Query
db.users.find({
appId: 'XXXXXX',
properties: {
$all: [
{
$elemMatch: {
name: 'age',
value: {
$gt: 10
}
}
},
{
$elemMatch: {
name: 'alive',
value: true
}
}
]
}
})
Explain
{
"cursor" : "BtreeCursor appId_1_properties.name_1_properties.value_1",
"isMultiKey" : true,
"n" : 256,
"nscannedObjects" : 1000000,
"nscanned" : 2752228,
"nscannedObjectsAllPlans" : 1018802,
"nscannedAllPlans" : 2771030,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 21648,
"nChunkSkips" : 0,
"millis" : 7425,
"indexBounds" : {
"appId" : [
[
"XXXXX",
"XXXXX"
]
],
"properties.name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"properties.value" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"filterSet" : false
}
I assume this is because Mongo is unable to create suitable bounds since I am looking for both boolean and integer values.
My question is this: Is there a better way to structure my data, or modify my query to improve performance and take better advantage of my index? Is it possible to instruct mongo to treat each condition separately, generate appropriate bounds, and then perform the intersection of the results, instead of scanning all documents? Or is mongo just not suited for this type of use case?

I know this is an old question, but I think it would be much better to structure your data without the "name" and "value" tags:
{
"appId": "XXXXXXX",
"properties": [
{ "age": 30 },
{ "gender: "female" },
{ "alive": true }
]
}

Related

Add a json attribute in a flow content with Jolt Transformation or alternative way in NiFi

I need to add this attribute named 'metadata' to json flow content.
The attribute 'metadata' is like:
{"startTime":1451952013663, "endTime":1453680013663, "name":"Npos19", "deleted":false}
The input is like this:
{
"id": 154299613718447,
"values": [
{
"timestamp": 1451977869683,
"value": 13.1
},
{
"timestamp": 1453949805784,
"value": 7.54
}
]
}
My goal is like:
{
"id": 154299613718447,
"values": [ {
"startTime":1451952013663,
"endTime":1453680013663,
"name":"Npos19",
"deleted":false,
"timestamp": 1451977869683,
"value": 13.1
},
{
"startTime":1451952013663,
"endTime":1453680013663,
"name":"Npos19",
"deleted":false,
"timestamp": 1453949805784,
"value": 7.54
}
]
}
I tried to use the Jolt Transformation:
{
"operation": "default",
"spec": {
// extract metadata array from json attribute and put it in a temporary array
"tempArray": "${metadata:jsonPath('$.*')}"
}
}
but it does not work. I need to extract metadata array with $.* because I do not know what keys will be present.
Is there an alternative fast way with other nifi processors to merge the attribute with flow content?
thanks in advance

It's possible with combination of two processors: EvaluateJsonPath ->ScriptedTransformRecord.
EvaluateJsonPath
Destination: flowfile-attribute
Return Type: json
values (dynamic property): $.values
ScriptedTransformRecord
Record Reader: JsonTreeReader
Record Writer: JsonRecordSetWriter
Script Language: Groovy
Script Body:
def mapMetadata = new groovy.json.JsonSlurper().parseText(attributes['metadata'])
def mapValue = new groovy.json.JsonSlurper().parseText(attributes['values'])
def values = mapValue.each { value ->
mapMetadata.each { k, v ->
value."${k}" = v
}
}
record.setValue('values', null)
record.setValue('updateValues', values)
record
Output json
[ {
"id" : 154299613718447,
"values" : null,
"updateValues" : [ {
"timestamp" : 1451977869683,
"value" : 13.1,
"startTime" : 1451952013663,
"endTime" : 1453680013663,
"name" : "Npos19",
"deleted" : false
}, {
"timestamp" : 1453949805784,
"value" : 7.54,
"startTime" : 1451952013663,
"endTime" : 1453680013663,
"name" : "Npos19",
"deleted" : false
} ]
} ]

MongoDB: Get count of inner array object with nested array element match

I have mongo collection with survey answers submitted by each user. I would like to get the count of users selected as an option. Only one user has selected the option O12. The output should be 1.
{
"_id" : ObjectId("5ea179eb39ff117948f19266"),
"_class" : "model.survey.Answer",
"survey_id" : "5ea178c239ff117948f19265",
"survey_user" : [
{
"user_id" : 1072,
"user_option" : [
{
"question_id" : "Q1",
"option_id" : "O11"
},
{
"question_id" : "Q2",
"option_id" : "O21"
},
{
"question_id" : "Q3",
"option_id" : "O31"
},
{
"question_id" : "Q4",
"option_id" : "O41"
}
]
},
{
"user_id" : 1073,
"user_option" : [
{
"question_id" : "Q1",
"option_id" : "O12"
},
{
"question_id" : "Q2",
"option_id" : "O21"
},
{
"question_id" : "Q3",
"option_id" : "O31"
},
{
"question_id" : "Q4",
"option_id" : "O41"
}
]
}
]
}

You can do that using MongoDB's aggregation-pipeline :
Different ways to do it, One way is to use $unwind:
Type 1 - Query 1 :
db.collection.aggregate([
/** Optional but will be good on huge collections to lessen data for further stages */
{
$match: { "survey_user.user_option.option_id": "O12" }
},
{
$unwind: "$survey_user"
},
/** When you unwind a each object/element in array gets it's own document after `unwind` stage */
{
$match: { "survey_user.user_option.option_id": "O12" }
},
/** After match you'll only have objects which met the criteria in `survey_user` array */
/** group on `_id` & push entire original doc to data field */
{
$group: { _id: "$_id", survey_user: { $push: "$survey_user" }, data: { $first: "$$ROOT" } }
},
/** Add `survey_user` array to `data.survey_user` & it's size to `data.optedCount` field */
{
$addFields: { "data.survey_user": "$survey_user", "data.optedCount": { $size: "$survey_user" } }
},
/** Make `data` as new root to doc */
{
$replaceRoot: { newRoot: "$data" }
}
])
Test : mongoplayground
Just in case if you just need count but not needed the entire doc to be returned there will be a minor change in above query :
Type 1 - Query 2 :
db.collection.aggregate([
{
$match: { "survey_user.user_option.option_id": "O12" }
},
{
$unwind: "$survey_user"
},
{
$match: { "survey_user.user_option.option_id": "O12" }
},
/** Just group on `_id` & count no.of docs, maintain `survey_id` */
{
$group: { _id: "$_id", optedCount: { $sum: 1 }, survey_id: { $first: "$survey_id" } }
}
])
Test : mongoplayground
Using array iterator $reduce, which might be helpful if your collections data is so huge, as unwind will explode your docs.
Type 2 - Query :
db.collection.aggregate([
{
$match: {
"survey_user.user_option.option_id": "O12",
},
},
/** Instead of `$addFields`, you can use `$project` to project fewer needed fields (which can be help improve query with performance benefits ) */
{
$addFields: {
optedCount: {
$reduce: {
input: "$survey_user",
initialValue: 0,
in: {
$cond: [
{ $in: ["O12", "$$this.user_option.option_id"] },
{ $add: ["$$value", 1] },
"$$value",
]
}
}
}
}
}
]);
Test : mongoplayground

Spring Data MongoDB building dynamic query

Need help to build dynamic MongoDB query.
everything inside the "$or" Array is dynamic.
db.group.find({
"version" : NumberLong(0),
"$or" : [{
"$and" : [
{
"object_type" : "D"
},
{
"type" : "R"
},
{
"name" : "1"
}
]
},{
"$and" : [
{
"object_type" : "D"
},
{
"type" : "E"
},
{
"name" : "2"
}
]
]
});
Did the below spring data query but doesn't work
Criteria criteria = Criteria.where("version").is("123");
List<Criteria> docCriterias = new ArrayList<Criteria>();
groups.stream().forEach(grp -> {
docCriterias.add(Criteria.where("type").is(grp.get("type").toString())
.andOperator(Criteria.where("object_type").is(grp.get("objectType").toString()))
.andOperator(Criteria.where("name").is(grp.get("name").toString())));
});
criteria.orOperator((Criteria[]) docCriterias.toArray());
Query q = new Query(criteria);
Thanks for the help

You should pay attention to how you combine the operators.
The ff code should work for you (note this is groovy remember to change the closure into to java lambda expression):
List<Criteria> docCriterias = new ArrayList<Criteria>();
List groups = [
[
type: "type1",
object_type: "object_type1",
name: "name1"
],
[
type: "type2",
object_type: "object_type2",
name: "name2"
],
[
type: "type3",
object_type: "object_type3",
name: "name3"
],
]
groups.stream().each {grp ->
docCriterias.add(new Criteria().andOperator(
Criteria.where("type").is(grp.get("type")),
Criteria.where("object_type").is(grp.get("object_type")),
Criteria.where("name").is(grp.get("name"))
))
};
Criteria criteria = new Criteria().andOperator(
Criteria.where("version").is("123"),
new Criteria().orOperator(docCriterias.toArray(new Criteria[docCriterias.size()]))
);
Query q = new Query(criteria);
Which will give you this query:
{
"$and":[
{
"version":"123"
},
{
"$or":[
{
"$and":[
{
"type":"type1"
},
{
"object_type":"object_type1"
},
{
"name":"name1"
}
]
},
{
"$and":[
{
"type":"type2"
},
{
"object_type":"object_type2"
},
{
"name":"name2"
}
]
},
{
"$and":[
{
"type":"type3"
},
{
"object_type":"object_type3"
},
{
"name":"name3"
}
]
}
]
}
]
},
Fields:{
},
Sort:{
}

You could reach this using MongoDB Aggregation Pipeline in Json and Apache Velocity to customize more the Query, then execute this using db.runCommand using Spring MongoTemplate.
Example:
monodb_client_dynamic_query.vm
{
"aggregate": "client",
"pipeline": [
{
"$match" : {
"$and" : [
{
"is_removed" : {
"$ne" : [
true
]
}
},
{
"errors" : {
"$size" : 0.0
}
},
{
"client_id": "$velocityMap.client_id"
}
]
}
},
{
"$project" : {
"_id" : -1.0,
"account" : "$_id.account",
"person_id" : "$_id.person_id",
"begin_date": { $dateToString: { format: "%Y-%m-%d", date: "$value.begin_date" } },
"end_date": { $dateToString: { format: "%Y-%m-%d", date: "$value.end_date" } }
}
}
]
}
Then execute using MondoTemplate:
String script = ...load from file the script monodb_client_dynamic_query.vm
Map parameters = ... put all variables to replace in the mongodb script
String scriptNoSql = VelocityUtil.loadTemplateVM(script, parameters);
DBObject dbObject = (BasicDBObject) JSON.parse(scriptNoSql);
if (null == dbObject) {
return;
}
DB db = mongoTemplate.getDb();
CommandResult result = db.command(dbObject);
if(!result.ok()) {
throw result.getException();
}

Limit aggregation in grouped aggregation

I had a collection like this, but with much more data.
{
_id: ObjectId("db759d014f70743495ef1000"),
tracked_item_origin: "winword",
tracked_item_type: "Software",
machine_user: "mmm.mmm",
organization_id: ObjectId("a91864df4f7074b33b020000"),
group_id: ObjectId("20ea74df4f7074b33b520000"),
tracked_item_id: ObjectId("1a050df94f70748419140000"),
tracked_item_name: "Word",
duration: 9540,
}
{
_id: ObjectId("2b769d014f70743495fa1000"),
tracked_item_origin: "http://www.facebook.com",
tracked_item_type: "Site",
machine_user: "gabriel.mello",
organization_id: ObjectId("a91864df4f7074b33b020000"),
group_id: ObjectId("3f6a64df4f7074b33b040000"),
tracked_item_id: ObjectId("6f3466df4f7074b33b080000"),
tracked_item_name: "Facebook",
duration: 7920,
}
I do an aggregation, ho return grouped data like this:
{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"Twitter"}, "duration"=>288540},
{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"ANoticia"}, "duration"=>237300},
{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"Facebook"}, "duration"=>203460},
{"_id"=>{"tracked_item_type"=>"Software", "tracked_item_name"=>"Word"}, "duration"=>269760},
{"_id"=>{"tracked_item_type"=>"Software", "tracked_item_name"=>"Excel"}, "duration"=>204240}
Simple aggregation code:
AgentCollector.collection.aggregate(
{'$match' => {group_id: '20ea74df4f7074b33b520000'}},
{'$group' => {
_id: {tracked_item_type: '$tracked_item_type', tracked_item_name: '$tracked_item_name'},
duration: {'$sum' => '$duration'}
}},
{'$sort' => {
'_id.tracked_item_type' => 1,
duration: -1
}}
)
There is a way to limit only 2 items by tracked_item_type key? Ex. 2 Sites and 2 Softwares.

As your question currently stands unclear, I really hope you mean that you want to specify two Site keys and 2 Software keys because that's a nice and simple answer that you can just add to your $match phase as in:
{$match: {
group_id: "20ea74df4f7074b33b520000",
tracked_item_name: {$in: ['Twitter', 'Facebook', 'Word', 'Excel' ] }
}},
And we can all cheer and be happy ;)
If however your question is something more diabolical such as, getting the top 2 Sites and Software entries from the result by duration, then we thank you very much for spawning this abomination.
Warning:
Your mileage may vary on what you actually want to do or whether this is going to blow up by the sheer size of your results. But this follows as an example of what you are in for:
db.collection.aggregate([
// Match items first to reduce the set
{$match: {group_id: "20ea74df4f7074b33b520000" }},
// Group on the types and "sum" of duration
{$group: {
_id: {
tracked_item_type: "$tracked_item_type",
tracked_item_name: "$tracked_item_name"
},
duration: {$sum: "$duration"}
}},
// Sort by type and duration descending
{$sort: { "_id.tracked_item_type": 1, duration: -1 }},
/* The fun part */
// Re-shape results to "sites" and "software" arrays
{$group: {
_id: null,
sites: {$push:
{$cond: [
{$eq: ["$_id.tracked_item_type", "Site" ]},
{ _id: "$_id", duration: "$duration" },
null
]}
},
software: {$push:
{$cond: [
{$eq: ["$_id.tracked_item_type", "Software" ]},
{ _id: "$_id", duration: "$duration" },
null
]}
}
}},
// Remove the null values for "software"
{$unwind: "$software"},
{$match: { software: {$ne: null} }},
{$group: {
_id: "$_id",
software: {$push: "$software"},
sites: {$first: "$sites"}
}},
// Remove the null values for "sites"
{$unwind: "$sites"},
{$match: { sites: {$ne: null} }},
{$group: {
_id: "$_id",
software: {$first: "$software"},
sites: {$push: "$sites"}
}},
// Project out software and limit to the *top* 2 results
{$unwind: "$software"},
{$project: {
_id: 0,
_id: { _id: "$software._id", duration: "$software.duration" },
sites: "$sites"
}},
{$limit : 2},
// Project sites, grouping multiple software per key, requires a sort
// then limit the *top* 2 results
{$unwind: "$sites"},
{$group: {
_id: { _id: "$sites._id", duration: "$sites.duration" },
software: {$push: "$_id" }
}},
{$sort: { "_id.duration": -1 }},
{$limit: 2}
])
Now what that results in is *not exactly the clean set of results that would be ideal but it is something that can be programatically worked with, and better than filtering the previous results in a loop. (My data from testing)
{
"result" : [
{
"_id" : {
"_id" : {
"tracked_item_type" : "Site",
"tracked_item_name" : "Digital Blasphemy"
},
"duration" : 8000
},
"software" : [
{
"_id" : {
"tracked_item_type" : "Software",
"tracked_item_name" : "Word"
},
"duration" : 9540
},
{
"_id" : {
"tracked_item_type" : "Software",
"tracked_item_name" : "Notepad"
},
"duration" : 4000
}
]
},
{
"_id" : {
"_id" : {
"tracked_item_type" : "Site",
"tracked_item_name" : "Facebook"
},
"duration" : 7920
},
"software" : [
{
"_id" : {
"tracked_item_type" : "Software",
"tracked_item_name" : "Word"
},
"duration" : 9540
},
{
"_id" : {
"tracked_item_type" : "Software",
"tracked_item_name" : "Notepad"
},
"duration" : 4000
}
]
}
],
"ok" : 1
}
So you see you get the top 2 Sites in the array, with the top 2 Software items embedded in each. Aggregation itself, cannot clear this up any further, because we would need to re-merge the items we split apart in order to do this, and as yet there is no operator that we could use to perform this action.
But that was fun. It's not all the way done, but most of the way, and making that into a 4 document response would be relatively trivial code. But my head hurts already.

Server logs to Dojo Front end, appropriate data storage method?

Good Morning all, I am currently building a tool which scrapes a number of figures from various logs on several different servers using AUX and Shell scripts on a minute by minute basis. Currently I have scripts that produce HTML files also on a minute by minute basis so that I can look back and see the system health for any given minute. So essentially these HTML files as being used as the archiving medium, which obviously is very strange.
Taking this tool forward I am looking to build a front end on a DOJO platform which will support widgets and graphs for each of the different data sets that is gathered from the logs. I haven't used DOJO much in the past and was hoping that somebody more experienced in the technology would be able to tell me what storage medium I should be using for these simple metrics that are produced by the scripts every minute. I have heard of the JSON method but am not sure.
Thanks for your time!
Jonny

Just to follow up on this in the rare chance it may be stumbled upon by another user. I decided that JSON was definitely the way forward. Each minute I have a system that produces a JSON files like the below.
{
"DataDTS" : "2014.05.02-09:00",
"System" : "Performance Test",
"SourceServer" : "gb02qws122debx7",
"OldestData" : "2014.05.02-09:00",
"MetricData" : {
"FredHopper Direct" : {
"Small" : "FH",
"Type" : "ByMinute",
"HeaderLevels" : 1,
"Data": { "Failures":0, "Avg":"0.318", "Min":"0.306", "Max":"0.340" }
},
"Performance By Instance" : {
"Small" : "RATE",
"Type" : "ByMinute",
"HeaderLevels" : 2,
"Sections" : {
"102.01" : { "Attributes" : ["(20)"], "Data": { "Req":"999", "Avg Resp":"0.254" } },
"102.02" : { "Attributes" : ["(20)"], "Data": { "Req":"144", "Avg Resp":"0.376" } },
"103.01" : { "Attributes" : ["(20)"], "Data": { "Req":"168", "Avg Resp":"0.199" } },
"103.02" : { "Attributes" : ["(20)"], "Data": { "Req":"152", "Avg Resp":"0.283" } },
"Unallocated" : { "Attributes": [], "Data": {"Req":"87", "Avg Resp":"0.154"} },
"qws122" : { "Attributes": [], "Data": {"Tot Req":"460", "Rate PPS":"7.7"} }
}
},
"HTML Status" : {
"Small" : "HTML",
"Type" : "Status",
"HeaderLevels" : 1,
"Data" : [
{ "Code":"200", "Req":447 },
{ "Code":"206", "Req":1 },
{ "Code":"301", "Req":7 },
{ "Code":"302", "Req":5 },
{ "Code":"TOT", "Req":460 }
]
},
"Page Breakdown" : {
"Small" : "PB",
"Type" : "Status",
"HeaderLevels" : 2,
"Sections" : {
"Top 15 Average Response Times" : {
"Attributes": [],
"Data" : [
{ "URL":"men", "Req":1, "Avg Resp":"3.597", "Cached":"0.0%" },
{ "URL":"jacques-vert", "Req":1, "Avg Resp":"3.335", "Cached":"0.0%" },
{ "URL":"OrderOKView", "Req":1, "Avg Resp":"1.615", "Cached":"0.0%" },
{ "URL":"warehouse", "Req":1, "Avg Resp":"1.050", "Cached":"0.0%" }
]
},
"Top 15 Requests" : {
"Attributes": [],
"Data" : [
{ "URL":"Navigate", "Req":220, "Avg Resp":"0.372", "Cached":"26.4%" },
{ "URL":"prod", "Req":42, "Avg Resp":"0.186", "Cached":"13.5%" },
{ "URL":"sureroute-test-object.html", "Req":24, "Avg Resp":"0.001", "Cached":"0.0%" },
{ "URL":"TopCategories1", "Req":8, "Avg Resp":"0.032", "Cached":"0.0%" }
]
}
}
},
"App server CPU" : {
"Small" : "CPU",
"Type" : "ByMinute",
"HeaderLevels" : 2,
"Sections" : {
"qap302" : { "Attributes" : ["Ent:7.0", "Max:8.0", "Mem:32768MB", "Wgt:200"], "Data": { "usr%":"16.8", "iow%":"1.4", "phy":"1.8", "ent%":"25.1" } },
"qap312" : { "Attributes" : ["Ent:7.0", "Max:8.0", "Mem:32768MB", "Wgt:200"], "Data": { "usr%":"8.1", "iow%":"0.1", "phy":"0.8", "ent%":"11.0" } }
}
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

MongoDB scans entire index when using $all and $elemMatch - performance

I know this is an old question, but I think it would be much better to structure your data without the "name" and "value" tags: { "appId": "XXXXXXX", "properties": [ { "age": 30 }, { "gender: "female" }, { "alive": true } ] }

Related

Add a json attribute in a flow content with Jolt Transformation or alternative way in NiFi

MongoDB: Get count of inner array object with nested array element match

Spring Data MongoDB building dynamic query

Limit aggregation in grouped aggregation

Server logs to Dojo Front end, appropriate data storage method?

Categories

Resources