Ruby MongodB - improving speed when working with multiple collections - ruby

I'm using MongoDB with Ruby using mongo gem.
I have the following scenario:
for each document in a collection say coll1, look at key1 and key2
search for document in another collection say coll2 with matching values for key1 and key2
if there is a match, add document fetched in #2 with a new key key3 whose value be set to value of key3 in the document referenced in #1
insert the updated hash into a new collection coll3
The general guideline with MongoDB has been to handle cross collection operations in application code.
So I do the following:
client = Mongo::Client.new([ '127.0.0.1:27017' ], :database => some_db,
:server_selection_timeout => 5)
cursor = client[:coll1].find({}, { :projection => {:_id => 0} }) # exclude _id
cursor.each do |doc|
doc_coll2 = client[:coll2].find('$and' => [{:key1 => doc[:key1]}, {:key2 => doc[:key2] }]).limit(1).first # no find_one method
if(doc_coll2 && doc[:key3])
doc_coll2[:key3] = doc[:key3]
doc_coll2.delete(:_id) # remove key :_id
client[:coll3].insert_one(doc_coll2)
end
end
This works, but it takes a lot of time to finish this job - approximately 250ms per document in collection coll1 or 3600s (1 hour) for ~15000 records, which seems a lot, which could be associated with reading the document one at a time, do the check in app code and then writing one doc at a time back to a new collection.
Is there a way to get this operation be done faster? Is the way I'm doing even the right way to do it?
Example documents
coll1
{
"_id" : ObjectId("588610ead0ae360cb815e55f"),
"key1" : "115384042",
"key2" : "276209",
"key3" : "10101122317876"
}
coll2
{
"_id" : ObjectId("788610ead0ae360def15e88e"),
"key1" : "115384042",
"key2" : "276209",
"key4" : 10,
"key5" : 4,
"key6" : 0,
"key7" : "false",
"key8" : 0,
"key9" : "false"
}
coll3
{
"_id" : ObjectId("788610ead0ae360def15e88e"),
"key1" : "115384042",
"key2" : "276209",
"key3" : "10101122317876",
"key4" : 10,
"key5" : 4,
"key6" : 0,
"key7" : "false",
"key8" : 0,
"key9" : "false"
}

A solution would be to use aggregation instead, and do this in one single query:
perform a join on key1 field with $lookup
unwind the array with $unwind
keep doc where coll1.key2 == coll2.key2 with $redact
reformat the document with $project
write it to coll3 with $out
so the query would be :
db.coll1.aggregate([
{ "$lookup": {
"from": "coll2",
"localField": "key1",
"foreignField": "key1",
"as": "coll2_doc"
}},
{ "$unwind": "$coll2_doc" },
{ "$redact": {
"$cond": [
{ "$eq": [ "$key2", "$coll2_doc.key2" ] },
"$$KEEP",
"$$PRUNE"
]
}},
{
$project: {
key1: 1,
key2: 1,
key3: 1,
key4: "$coll2_doc.key4",
key5: "$coll2_doc.key5",
key6: "$coll2_doc.key6",
key7: "$coll2_doc.key7",
key8: "$coll2_doc.key8",
key9: "$coll2_doc.key9",
}
},
{$out: "coll3"}
], {allowDiskUse: true} );
and db.coll3.find() would return
{
"_id" : ObjectId("588610ead0ae360cb815e55f"),
"key1" : "115384042",
"key2" : "276209",
"key3" : "10101122317876",
"key4" : 10,
"key5" : 4,
"key6" : 0,
"key7" : "false",
"key8" : 0,
"key9" : "false"
}
Edit: MongoDB 3.4 solution
If you don't want to specify all keys in the $project stage, you can take advantage of $addFields and $replaceRoot, two new operators introduced in MongoDB 3.4
the query would become:
db.coll1.aggregate([
{ "$lookup": {
"from": "coll2",
"localField": "key1",
"foreignField": "key1",
"as": "coll2_doc"
}},
{ "$unwind": "$coll2_doc" },
{ "$redact": {
"$cond": [
{ "$eq": [ "$key2", "$coll2_doc.key2" ] },
"$$KEEP",
"$$PRUNE"
]
}},
{$addFields: {"coll2_doc.key3": "$key3" }},
{$replaceRoot: {newRoot: "$coll2_doc"}},
{$out: "coll3"}
], {allowDiskUse: true})

After toying around this for sometime, realized that index were not added. Adding index reduces the query run time by orders of magnitude.
To add index, do the following.
db.coll1.ensureIndex({"key1": 1, "key2": 1});
db.coll2.ensureIndex({"key1": 1, "key2": 1});
Using index the overall query run time came to 1/10xxxxxxth of what it was earlier.
The learning is that while working with large data sets, index the fields used for find - that itself reduces query run time a lot.

Related

How to fetch field from array of objects in Elasticsearch Index as CSV file to Google Cloud Storage Using Logstash

I am using ElasticSearch to index data and wanted to export few fields from index created every day to Google cloud storage, How to get fields from array of objects in elastic search index and send them as csv file to GCS bucket using Logstash
Tried below conf to fetch nested fields from index:
input {
elasticsearch {
hosts => "host:443"
user => "user"
ssl => true
connect_timeout_seconds => 600
request_timeout_seconds => 600
password => "pwd"
ca_file => "ca.crt"
index => "test"
query => '
{
"_source": ["obj1.Name","obj1.addr","obj1.obj2.location", "Hierarchy.categoryUrl"],
"query": {
"match_all": {}
}
}
'
}
}
filter {
mutate {
rename => {
"[obj1][Name]" => "col1"
"[obj1][addr]" => "col2"
"[obj1][obj2][location]" => "col3"
"[Hierarchy][0][categoryUrl]" => "col4"
}
}
}
output {
google_cloud_storage {
codec => csv {
include_headers => true
columns => [ "col1", "col2","col3"]
}
bucket => "bucket"
json_key_file => "creds.json"
temp_directory => "/tmp"
log_file_prefix => "log_gcs"
max_file_size_kbytes => 1024
date_pattern => "%Y-%m-%dT%H:00"
flush_interval_secs => 600
gzip => false
uploader_interval_secs => 600
include_uuid => true
include_hostname => true
}
}
How to get field populated to above csv from array of objects, in below example wanted to fetch categoryUrl from the first object of an array and populate to csv table and send it to GCS Bucket:
have tried below approaches :
"_source": ["obj1.Name","obj1.addr","obj1.obj2.location", "Hierarchy.categoryUrl"]
and
"_source": ["obj1.Name","obj1.addr","obj1.obj2.location", "Hierarchy[0].categoryUrl"]
with
mutate {
rename => {
"[obj1][Name]" => "col1"
"[obj1][addr]" => "col2"
"[obj1][obj2][location]" => "col3"
"[Hierarchy][0][categoryUrl]" => "col4"
}
for input sample :
"Hierarchy" : [
{
"level" : "1",
"category" : "test",
"categoryUrl" : "testurl1"
},
{
"level" : "2",
"category" : "test2",
"categoryUrl" : "testurl2"
}}
Attaching sample document where I am trying to fetch merchandisingHierarchy[0].categoryUrl and pricingInfo[0].basePrice :
{
"_index" : "amulya-test",
"_type" : "_doc",
"_id" : "ldZPJoYBFi8LOEDK_M2f",
"_score" : 1.0,
"_ignored" : [
"itemDetails.description.keyword"
],
"_source" : {
"itemDetails" : {
"compSku" : "202726",
"compName" : "abc.com",
"compWebsite" : "abc.com",
"title" : "Monteray 38.25 in. x 73.375 in. Frameless Hinged Corner Shower Enclosure in Brushed Nickel",
"description" : "Create the modthroom of your dreams with the clean lines of the VIGO Monteray Frameless Shower Enclosure. Solid 3/8 in. tempered glass combined with stainless steel and solid brass construction makes this enclosure strong and long-lasting. The sleek, reversible, outward-opening door features a convenient towel bar. This versatile enclosure can be installed on a tile floor or with a VIGO Shower Base. With a single water deflector along the bottom seal strip, water is redirected back into the shower to keep your bathroom dry, clean, and pristine.",
"modelNumber" : "VG6011BNCL40",
"upc" : "8137756684",
"hasVariations" : false,
"productDetailsBulletPoints" : [ ],
"itemUrls" : {
"productPageUrl" : "https://.abc.com/p/VIGO-Monteray-38-in-x-73-375-in-Frameless-Hinged-Corner-Shower-Enclosure-in-Brushed-Nickel-VG6011BNCL40/202722616",
"primaryImageUrl" : "https://images.thdstatic.com/productImages/d77d9e8b-1ea1-4811-a470-8364c8e47402/svn/vigo-shower-enclosures-vg6011bncl40-64_600.jpg",
"secondaryImageUrls" : [
"https://images.thdstatic.com/productImages/d77d9e8b-1e1-4811-a470-8364c8e47402/svn/vigo-shower-enclosures-vg6011bncl40-64_1000.jpg",
"https://images.thdstatic.com/productImages/db539ff9-6df-48c2-897a-18dd1e1794e3/svn/vigo-shower-enclosures-vg6011bncl40-e1_1000.jpg",
"https://images.thdstatic.com/productImages/47c5090b-49a-46bc-a36d-921ddae5e1ab/svn/vigo-shower-enclosures-vg6011bncl40-40_1000.jpg",
"https://images.thdstatic.com/productImages/add6691c-a02-466d-9a1a-47200b05685e/svn/vigo-shower-enclosures-vg6011bncl40-a0_1000.jpg",
"https://images.thdstatic.com/productImages/d638230e-0d9-40c9-be93-7f7bf24f0732/svn/vigo-shower-enclosures-vg6011bncl40-1d_1000.jpg"
]
}
},
"merchandisingHierarchy" : [
{
"level" : "1",
"category" : "Home",
"categoryUrl" : "host/"
},
{
"level" : "2",
"category" : "Bath",
"categoryUrl" : "host/b/Bath/N-5yc1vZbzb3"
},
{
"level" : "3",
"category" : "Showers",
"categoryUrl" : "host/b/Bath-Showers/N-5yc1vZbzcd"
},
{
"level" : "4",
"category" : "Shower Doors",
"categoryUrl" : "host/b/Bath-Showers-Shower-Doors/N-5yc1vZbzcg"
},
{
"level" : "5",
"category" : "Shower Enclosures",
"categoryUrl" : "host/b/Bath-Showers-Shower-Doors-Shower-Enclosures/N-5yc1vZcbn2"
}
],
"reviewsAndRatings" : {
"pdtReviewCount" : 105
},
"additionalAttributes" : {
"isAddon" : false
},
"productSpecifications" : {
"Warranties" : { },
"Details" : { },
"Dimensions" : { }
},
"promoDetails" : [
{
"promoName" : "Save $150.00 (15%)",
"promoPrice" : 849.9
}
],
"locationDetails" : { },
"storePickupDetails" : {
"deliveryText" : "Get it by Mon, Feb 20",
"toEddDate" : "Mon, Feb 20",
"isBackordered" : false,
"selectedEddZipcode" : "20147",
"shipToStoreEnabled" : true,
"homeDeliveryEnabled" : true,
"scheduledDeliveryEnabled" : false
},
"recommendedProducts" : [ ],
"pricingInfo" : [
{
"type" : "SAS",
"offerPrice" : 849.9,
"sellerName" : "abc.com",
"onClearance" : false,
"inStock" : true,
"isBuyBoxWinner" : true,
"promo" : [
{
"onPromo" : true,
"promoName" : "Save $150.00 (15%)",
"promoPrice" : 849.9
}
],
"basePrice" : 999.9,
"priceVariants" : [
{
"basePrice" : 999.9,
"offerPrice" : 849.9
}
],
"inventoryDetails" : {
"stockInStore" : false,
"stockOnline" : true
}
}
]
}
}
You can do it like this:
input {
elasticsearch {
...
query => '
{
"_source": ["merchandisingHierarchy.categoryUrl"],
"query": {
"match_all": {}
}
}
'
}
}
filter {
mutate {
add_field => {
"col1" => "%{[merchandisingHierarchy][0][categoryUrl]}"
"col2" => "%{[pricingInfo][0][basePrice]}"
}
}
}
output {
stdout {
codec => csv {
include_headers => true
columns => [ "col1"]
}
}
}
I've tested with your sample document and I get the output below, which looks like is working per your expectation:
col1,col2
host/,999.9

Add a json attribute in a flow content with Jolt Transformation or alternative way in NiFi

I need to add this attribute named 'metadata' to json flow content.
The attribute 'metadata' is like:
{"startTime":1451952013663, "endTime":1453680013663, "name":"Npos19", "deleted":false}
The input is like this:
{
"id": 154299613718447,
"values": [
{
"timestamp": 1451977869683,
"value": 13.1
},
{
"timestamp": 1453949805784,
"value": 7.54
}
]
}
My goal is like:
{
"id": 154299613718447,
"values": [ {
"startTime":1451952013663,
"endTime":1453680013663,
"name":"Npos19",
"deleted":false,
"timestamp": 1451977869683,
"value": 13.1
},
{
"startTime":1451952013663,
"endTime":1453680013663,
"name":"Npos19",
"deleted":false,
"timestamp": 1453949805784,
"value": 7.54
}
]
}
I tried to use the Jolt Transformation:
{
"operation": "default",
"spec": {
// extract metadata array from json attribute and put it in a temporary array
"tempArray": "${metadata:jsonPath('$.*')}"
}
}
but it does not work. I need to extract metadata array with $.* because I do not know what keys will be present.
Is there an alternative fast way with other nifi processors to merge the attribute with flow content?
thanks in advance
It's possible with combination of two processors: EvaluateJsonPath ->ScriptedTransformRecord.
EvaluateJsonPath
Destination: flowfile-attribute
Return Type: json
values (dynamic property): $.values
ScriptedTransformRecord
Record Reader: JsonTreeReader
Record Writer: JsonRecordSetWriter
Script Language: Groovy
Script Body:
def mapMetadata = new groovy.json.JsonSlurper().parseText(attributes['metadata'])
def mapValue = new groovy.json.JsonSlurper().parseText(attributes['values'])
def values = mapValue.each { value ->
mapMetadata.each { k, v ->
value."${k}" = v
}
}
record.setValue('values', null)
record.setValue('updateValues', values)
record
Output json
[ {
"id" : 154299613718447,
"values" : null,
"updateValues" : [ {
"timestamp" : 1451977869683,
"value" : 13.1,
"startTime" : 1451952013663,
"endTime" : 1453680013663,
"name" : "Npos19",
"deleted" : false
}, {
"timestamp" : 1453949805784,
"value" : 7.54,
"startTime" : 1451952013663,
"endTime" : 1453680013663,
"name" : "Npos19",
"deleted" : false
} ]
} ]

MongoDB scans entire index when using $all and $elemMatch

I have a collection of user documents, where each user can have an arbitrary set of properties. Each user is associated to an app document. Here is an example user:
{
"appId": "XXXXXXX",
"properties": [
{ "name": "age", "value": 30 },
{ "name": "gender", "value": "female" },
{ "name": "alive", "value": true }
]
}
I would like to be able to find/count users based on the values of their properties. For example, find me all users for app X that have property Y > 10 and Z equals true.
I have a compound, multikey index on this collection db.users.ensureIndex({ "appId": 1, "properties.name": 1, "properties.value": 1}). This index is working well for single condition queries, ex:
db.users.find({
appId: 'XXXXXX',
properties: {
$elemMatch: {
name: 'age',
value: {
$gt: 10
}
}
}
})
The above query completes in < 300ms with a collection of 1M users. However, when I try and add a second condition, the performance degrades considerably (7-8s), and the explain() output indicates that the whole index is being scanned to fulfill the query ("nscanned" : 2752228).
Query
db.users.find({
appId: 'XXXXXX',
properties: {
$all: [
{
$elemMatch: {
name: 'age',
value: {
$gt: 10
}
}
},
{
$elemMatch: {
name: 'alive',
value: true
}
}
]
}
})
Explain
{
"cursor" : "BtreeCursor appId_1_properties.name_1_properties.value_1",
"isMultiKey" : true,
"n" : 256,
"nscannedObjects" : 1000000,
"nscanned" : 2752228,
"nscannedObjectsAllPlans" : 1018802,
"nscannedAllPlans" : 2771030,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 21648,
"nChunkSkips" : 0,
"millis" : 7425,
"indexBounds" : {
"appId" : [
[
"XXXXX",
"XXXXX"
]
],
"properties.name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"properties.value" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"filterSet" : false
}
I assume this is because Mongo is unable to create suitable bounds since I am looking for both boolean and integer values.
My question is this: Is there a better way to structure my data, or modify my query to improve performance and take better advantage of my index? Is it possible to instruct mongo to treat each condition separately, generate appropriate bounds, and then perform the intersection of the results, instead of scanning all documents? Or is mongo just not suited for this type of use case?
I know this is an old question, but I think it would be much better to structure your data without the "name" and "value" tags:
{
"appId": "XXXXXXX",
"properties": [
{ "age": 30 },
{ "gender: "female" },
{ "alive": true }
]
}

Limit aggregation in grouped aggregation

I had a collection like this, but with much more data.
{
_id: ObjectId("db759d014f70743495ef1000"),
tracked_item_origin: "winword",
tracked_item_type: "Software",
machine_user: "mmm.mmm",
organization_id: ObjectId("a91864df4f7074b33b020000"),
group_id: ObjectId("20ea74df4f7074b33b520000"),
tracked_item_id: ObjectId("1a050df94f70748419140000"),
tracked_item_name: "Word",
duration: 9540,
}
{
_id: ObjectId("2b769d014f70743495fa1000"),
tracked_item_origin: "http://www.facebook.com",
tracked_item_type: "Site",
machine_user: "gabriel.mello",
organization_id: ObjectId("a91864df4f7074b33b020000"),
group_id: ObjectId("3f6a64df4f7074b33b040000"),
tracked_item_id: ObjectId("6f3466df4f7074b33b080000"),
tracked_item_name: "Facebook",
duration: 7920,
}
I do an aggregation, ho return grouped data like this:
{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"Twitter"}, "duration"=>288540},
{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"ANoticia"}, "duration"=>237300},
{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"Facebook"}, "duration"=>203460},
{"_id"=>{"tracked_item_type"=>"Software", "tracked_item_name"=>"Word"}, "duration"=>269760},
{"_id"=>{"tracked_item_type"=>"Software", "tracked_item_name"=>"Excel"}, "duration"=>204240}
Simple aggregation code:
AgentCollector.collection.aggregate(
{'$match' => {group_id: '20ea74df4f7074b33b520000'}},
{'$group' => {
_id: {tracked_item_type: '$tracked_item_type', tracked_item_name: '$tracked_item_name'},
duration: {'$sum' => '$duration'}
}},
{'$sort' => {
'_id.tracked_item_type' => 1,
duration: -1
}}
)
There is a way to limit only 2 items by tracked_item_type key? Ex. 2 Sites and 2 Softwares.
As your question currently stands unclear, I really hope you mean that you want to specify two Site keys and 2 Software keys because that's a nice and simple answer that you can just add to your $match phase as in:
{$match: {
group_id: "20ea74df4f7074b33b520000",
tracked_item_name: {$in: ['Twitter', 'Facebook', 'Word', 'Excel' ] }
}},
And we can all cheer and be happy ;)
If however your question is something more diabolical such as, getting the top 2 Sites and Software entries from the result by duration, then we thank you very much for spawning this abomination.
Warning:
Your mileage may vary on what you actually want to do or whether this is going to blow up by the sheer size of your results. But this follows as an example of what you are in for:
db.collection.aggregate([
// Match items first to reduce the set
{$match: {group_id: "20ea74df4f7074b33b520000" }},
// Group on the types and "sum" of duration
{$group: {
_id: {
tracked_item_type: "$tracked_item_type",
tracked_item_name: "$tracked_item_name"
},
duration: {$sum: "$duration"}
}},
// Sort by type and duration descending
{$sort: { "_id.tracked_item_type": 1, duration: -1 }},
/* The fun part */
// Re-shape results to "sites" and "software" arrays
{$group: {
_id: null,
sites: {$push:
{$cond: [
{$eq: ["$_id.tracked_item_type", "Site" ]},
{ _id: "$_id", duration: "$duration" },
null
]}
},
software: {$push:
{$cond: [
{$eq: ["$_id.tracked_item_type", "Software" ]},
{ _id: "$_id", duration: "$duration" },
null
]}
}
}},
// Remove the null values for "software"
{$unwind: "$software"},
{$match: { software: {$ne: null} }},
{$group: {
_id: "$_id",
software: {$push: "$software"},
sites: {$first: "$sites"}
}},
// Remove the null values for "sites"
{$unwind: "$sites"},
{$match: { sites: {$ne: null} }},
{$group: {
_id: "$_id",
software: {$first: "$software"},
sites: {$push: "$sites"}
}},
// Project out software and limit to the *top* 2 results
{$unwind: "$software"},
{$project: {
_id: 0,
_id: { _id: "$software._id", duration: "$software.duration" },
sites: "$sites"
}},
{$limit : 2},
// Project sites, grouping multiple software per key, requires a sort
// then limit the *top* 2 results
{$unwind: "$sites"},
{$group: {
_id: { _id: "$sites._id", duration: "$sites.duration" },
software: {$push: "$_id" }
}},
{$sort: { "_id.duration": -1 }},
{$limit: 2}
])
Now what that results in is *not exactly the clean set of results that would be ideal but it is something that can be programatically worked with, and better than filtering the previous results in a loop. (My data from testing)
{
"result" : [
{
"_id" : {
"_id" : {
"tracked_item_type" : "Site",
"tracked_item_name" : "Digital Blasphemy"
},
"duration" : 8000
},
"software" : [
{
"_id" : {
"tracked_item_type" : "Software",
"tracked_item_name" : "Word"
},
"duration" : 9540
},
{
"_id" : {
"tracked_item_type" : "Software",
"tracked_item_name" : "Notepad"
},
"duration" : 4000
}
]
},
{
"_id" : {
"_id" : {
"tracked_item_type" : "Site",
"tracked_item_name" : "Facebook"
},
"duration" : 7920
},
"software" : [
{
"_id" : {
"tracked_item_type" : "Software",
"tracked_item_name" : "Word"
},
"duration" : 9540
},
{
"_id" : {
"tracked_item_type" : "Software",
"tracked_item_name" : "Notepad"
},
"duration" : 4000
}
]
}
],
"ok" : 1
}
So you see you get the top 2 Sites in the array, with the top 2 Software items embedded in each. Aggregation itself, cannot clear this up any further, because we would need to re-merge the items we split apart in order to do this, and as yet there is no operator that we could use to perform this action.
But that was fun. It's not all the way done, but most of the way, and making that into a 4 document response would be relatively trivial code. But my head hurts already.

Mongodb java - how to delete particular document

{
CONTENT1:{
YDXM:[{
"name":"1",
"MBNH":"1"}
{"name":"2",
"MBNH":"2"}]
}
I want to delete the {"name":"1","MBNH":"1"}. How can I achieve this?
Assuming that the following is your document and you want to delete the ENTIRE document:
{
"CONTENT1": {
"YDXM": [
{
"name": "1",
"MBNH": "1"
},
{
"name": "2",
"MBNH": "2"
}
]
}
}
You could use this:
db.test.remove({"CONTENT1.YDXM.name" : "1", "CONTENT1.YDXM.MBNH" : "1"})
Now, if you want to extract the document {"name" : "1", "MBNH" : "1"} from the CONTENT1.YDXM array, you should use the $pull operator:
db.test.update({"CONTENT1.YDXM.name" : "1", "CONTENT1.YDXM.MBNH" : "1"}, { $pull : { "CONTENT1.YDXM" : {"name" : "1", "MBNH" : "1"} } }, false, true)
This will perform an update in all documents that matches with the first argument. The second argument, with the $pull operator, means that the mongodb will remove the value {"name" : "1", "MBNH" : "1"} from CONTENT1.YDXM array.
You could read more about $pull operator and the update command in this links:
http://docs.mongodb.org/manual/reference/operator/pull/
http://docs.mongodb.org/manual/applications/update/

Resources