Related
I need to add this attribute named 'metadata' to json flow content.
The attribute 'metadata' is like:
{"startTime":1451952013663, "endTime":1453680013663, "name":"Npos19", "deleted":false}
The input is like this:
{
"id": 154299613718447,
"values": [
{
"timestamp": 1451977869683,
"value": 13.1
},
{
"timestamp": 1453949805784,
"value": 7.54
}
]
}
My goal is like:
{
"id": 154299613718447,
"values": [ {
"startTime":1451952013663,
"endTime":1453680013663,
"name":"Npos19",
"deleted":false,
"timestamp": 1451977869683,
"value": 13.1
},
{
"startTime":1451952013663,
"endTime":1453680013663,
"name":"Npos19",
"deleted":false,
"timestamp": 1453949805784,
"value": 7.54
}
]
}
I tried to use the Jolt Transformation:
{
"operation": "default",
"spec": {
// extract metadata array from json attribute and put it in a temporary array
"tempArray": "${metadata:jsonPath('$.*')}"
}
}
but it does not work. I need to extract metadata array with $.* because I do not know what keys will be present.
Is there an alternative fast way with other nifi processors to merge the attribute with flow content?
thanks in advance
It's possible with combination of two processors: EvaluateJsonPath ->ScriptedTransformRecord.
EvaluateJsonPath
Destination: flowfile-attribute
Return Type: json
values (dynamic property): $.values
ScriptedTransformRecord
Record Reader: JsonTreeReader
Record Writer: JsonRecordSetWriter
Script Language: Groovy
Script Body:
def mapMetadata = new groovy.json.JsonSlurper().parseText(attributes['metadata'])
def mapValue = new groovy.json.JsonSlurper().parseText(attributes['values'])
def values = mapValue.each { value ->
mapMetadata.each { k, v ->
value."${k}" = v
}
}
record.setValue('values', null)
record.setValue('updateValues', values)
record
Output json
[ {
"id" : 154299613718447,
"values" : null,
"updateValues" : [ {
"timestamp" : 1451977869683,
"value" : 13.1,
"startTime" : 1451952013663,
"endTime" : 1453680013663,
"name" : "Npos19",
"deleted" : false
}, {
"timestamp" : 1453949805784,
"value" : 7.54,
"startTime" : 1451952013663,
"endTime" : 1453680013663,
"name" : "Npos19",
"deleted" : false
} ]
} ]
I'm using MongoDB with Ruby using mongo gem.
I have the following scenario:
for each document in a collection say coll1, look at key1 and key2
search for document in another collection say coll2 with matching values for key1 and key2
if there is a match, add document fetched in #2 with a new key key3 whose value be set to value of key3 in the document referenced in #1
insert the updated hash into a new collection coll3
The general guideline with MongoDB has been to handle cross collection operations in application code.
So I do the following:
client = Mongo::Client.new([ '127.0.0.1:27017' ], :database => some_db,
:server_selection_timeout => 5)
cursor = client[:coll1].find({}, { :projection => {:_id => 0} }) # exclude _id
cursor.each do |doc|
doc_coll2 = client[:coll2].find('$and' => [{:key1 => doc[:key1]}, {:key2 => doc[:key2] }]).limit(1).first # no find_one method
if(doc_coll2 && doc[:key3])
doc_coll2[:key3] = doc[:key3]
doc_coll2.delete(:_id) # remove key :_id
client[:coll3].insert_one(doc_coll2)
end
end
This works, but it takes a lot of time to finish this job - approximately 250ms per document in collection coll1 or 3600s (1 hour) for ~15000 records, which seems a lot, which could be associated with reading the document one at a time, do the check in app code and then writing one doc at a time back to a new collection.
Is there a way to get this operation be done faster? Is the way I'm doing even the right way to do it?
Example documents
coll1
{
"_id" : ObjectId("588610ead0ae360cb815e55f"),
"key1" : "115384042",
"key2" : "276209",
"key3" : "10101122317876"
}
coll2
{
"_id" : ObjectId("788610ead0ae360def15e88e"),
"key1" : "115384042",
"key2" : "276209",
"key4" : 10,
"key5" : 4,
"key6" : 0,
"key7" : "false",
"key8" : 0,
"key9" : "false"
}
coll3
{
"_id" : ObjectId("788610ead0ae360def15e88e"),
"key1" : "115384042",
"key2" : "276209",
"key3" : "10101122317876",
"key4" : 10,
"key5" : 4,
"key6" : 0,
"key7" : "false",
"key8" : 0,
"key9" : "false"
}
A solution would be to use aggregation instead, and do this in one single query:
perform a join on key1 field with $lookup
unwind the array with $unwind
keep doc where coll1.key2 == coll2.key2 with $redact
reformat the document with $project
write it to coll3 with $out
so the query would be :
db.coll1.aggregate([
{ "$lookup": {
"from": "coll2",
"localField": "key1",
"foreignField": "key1",
"as": "coll2_doc"
}},
{ "$unwind": "$coll2_doc" },
{ "$redact": {
"$cond": [
{ "$eq": [ "$key2", "$coll2_doc.key2" ] },
"$$KEEP",
"$$PRUNE"
]
}},
{
$project: {
key1: 1,
key2: 1,
key3: 1,
key4: "$coll2_doc.key4",
key5: "$coll2_doc.key5",
key6: "$coll2_doc.key6",
key7: "$coll2_doc.key7",
key8: "$coll2_doc.key8",
key9: "$coll2_doc.key9",
}
},
{$out: "coll3"}
], {allowDiskUse: true} );
and db.coll3.find() would return
{
"_id" : ObjectId("588610ead0ae360cb815e55f"),
"key1" : "115384042",
"key2" : "276209",
"key3" : "10101122317876",
"key4" : 10,
"key5" : 4,
"key6" : 0,
"key7" : "false",
"key8" : 0,
"key9" : "false"
}
Edit: MongoDB 3.4 solution
If you don't want to specify all keys in the $project stage, you can take advantage of $addFields and $replaceRoot, two new operators introduced in MongoDB 3.4
the query would become:
db.coll1.aggregate([
{ "$lookup": {
"from": "coll2",
"localField": "key1",
"foreignField": "key1",
"as": "coll2_doc"
}},
{ "$unwind": "$coll2_doc" },
{ "$redact": {
"$cond": [
{ "$eq": [ "$key2", "$coll2_doc.key2" ] },
"$$KEEP",
"$$PRUNE"
]
}},
{$addFields: {"coll2_doc.key3": "$key3" }},
{$replaceRoot: {newRoot: "$coll2_doc"}},
{$out: "coll3"}
], {allowDiskUse: true})
After toying around this for sometime, realized that index were not added. Adding index reduces the query run time by orders of magnitude.
To add index, do the following.
db.coll1.ensureIndex({"key1": 1, "key2": 1});
db.coll2.ensureIndex({"key1": 1, "key2": 1});
Using index the overall query run time came to 1/10xxxxxxth of what it was earlier.
The learning is that while working with large data sets, index the fields used for find - that itself reduces query run time a lot.
I had a collection like this, but with much more data.
{
_id: ObjectId("db759d014f70743495ef1000"),
tracked_item_origin: "winword",
tracked_item_type: "Software",
machine_user: "mmm.mmm",
organization_id: ObjectId("a91864df4f7074b33b020000"),
group_id: ObjectId("20ea74df4f7074b33b520000"),
tracked_item_id: ObjectId("1a050df94f70748419140000"),
tracked_item_name: "Word",
duration: 9540,
}
{
_id: ObjectId("2b769d014f70743495fa1000"),
tracked_item_origin: "http://www.facebook.com",
tracked_item_type: "Site",
machine_user: "gabriel.mello",
organization_id: ObjectId("a91864df4f7074b33b020000"),
group_id: ObjectId("3f6a64df4f7074b33b040000"),
tracked_item_id: ObjectId("6f3466df4f7074b33b080000"),
tracked_item_name: "Facebook",
duration: 7920,
}
I do an aggregation, ho return grouped data like this:
{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"Twitter"}, "duration"=>288540},
{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"ANoticia"}, "duration"=>237300},
{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"Facebook"}, "duration"=>203460},
{"_id"=>{"tracked_item_type"=>"Software", "tracked_item_name"=>"Word"}, "duration"=>269760},
{"_id"=>{"tracked_item_type"=>"Software", "tracked_item_name"=>"Excel"}, "duration"=>204240}
Simple aggregation code:
AgentCollector.collection.aggregate(
{'$match' => {group_id: '20ea74df4f7074b33b520000'}},
{'$group' => {
_id: {tracked_item_type: '$tracked_item_type', tracked_item_name: '$tracked_item_name'},
duration: {'$sum' => '$duration'}
}},
{'$sort' => {
'_id.tracked_item_type' => 1,
duration: -1
}}
)
There is a way to limit only 2 items by tracked_item_type key? Ex. 2 Sites and 2 Softwares.
As your question currently stands unclear, I really hope you mean that you want to specify two Site keys and 2 Software keys because that's a nice and simple answer that you can just add to your $match phase as in:
{$match: {
group_id: "20ea74df4f7074b33b520000",
tracked_item_name: {$in: ['Twitter', 'Facebook', 'Word', 'Excel' ] }
}},
And we can all cheer and be happy ;)
If however your question is something more diabolical such as, getting the top 2 Sites and Software entries from the result by duration, then we thank you very much for spawning this abomination.
Warning:
Your mileage may vary on what you actually want to do or whether this is going to blow up by the sheer size of your results. But this follows as an example of what you are in for:
db.collection.aggregate([
// Match items first to reduce the set
{$match: {group_id: "20ea74df4f7074b33b520000" }},
// Group on the types and "sum" of duration
{$group: {
_id: {
tracked_item_type: "$tracked_item_type",
tracked_item_name: "$tracked_item_name"
},
duration: {$sum: "$duration"}
}},
// Sort by type and duration descending
{$sort: { "_id.tracked_item_type": 1, duration: -1 }},
/* The fun part */
// Re-shape results to "sites" and "software" arrays
{$group: {
_id: null,
sites: {$push:
{$cond: [
{$eq: ["$_id.tracked_item_type", "Site" ]},
{ _id: "$_id", duration: "$duration" },
null
]}
},
software: {$push:
{$cond: [
{$eq: ["$_id.tracked_item_type", "Software" ]},
{ _id: "$_id", duration: "$duration" },
null
]}
}
}},
// Remove the null values for "software"
{$unwind: "$software"},
{$match: { software: {$ne: null} }},
{$group: {
_id: "$_id",
software: {$push: "$software"},
sites: {$first: "$sites"}
}},
// Remove the null values for "sites"
{$unwind: "$sites"},
{$match: { sites: {$ne: null} }},
{$group: {
_id: "$_id",
software: {$first: "$software"},
sites: {$push: "$sites"}
}},
// Project out software and limit to the *top* 2 results
{$unwind: "$software"},
{$project: {
_id: 0,
_id: { _id: "$software._id", duration: "$software.duration" },
sites: "$sites"
}},
{$limit : 2},
// Project sites, grouping multiple software per key, requires a sort
// then limit the *top* 2 results
{$unwind: "$sites"},
{$group: {
_id: { _id: "$sites._id", duration: "$sites.duration" },
software: {$push: "$_id" }
}},
{$sort: { "_id.duration": -1 }},
{$limit: 2}
])
Now what that results in is *not exactly the clean set of results that would be ideal but it is something that can be programatically worked with, and better than filtering the previous results in a loop. (My data from testing)
{
"result" : [
{
"_id" : {
"_id" : {
"tracked_item_type" : "Site",
"tracked_item_name" : "Digital Blasphemy"
},
"duration" : 8000
},
"software" : [
{
"_id" : {
"tracked_item_type" : "Software",
"tracked_item_name" : "Word"
},
"duration" : 9540
},
{
"_id" : {
"tracked_item_type" : "Software",
"tracked_item_name" : "Notepad"
},
"duration" : 4000
}
]
},
{
"_id" : {
"_id" : {
"tracked_item_type" : "Site",
"tracked_item_name" : "Facebook"
},
"duration" : 7920
},
"software" : [
{
"_id" : {
"tracked_item_type" : "Software",
"tracked_item_name" : "Word"
},
"duration" : 9540
},
{
"_id" : {
"tracked_item_type" : "Software",
"tracked_item_name" : "Notepad"
},
"duration" : 4000
}
]
}
],
"ok" : 1
}
So you see you get the top 2 Sites in the array, with the top 2 Software items embedded in each. Aggregation itself, cannot clear this up any further, because we would need to re-merge the items we split apart in order to do this, and as yet there is no operator that we could use to perform this action.
But that was fun. It's not all the way done, but most of the way, and making that into a 4 document response would be relatively trivial code. But my head hurts already.
I have a collection of documents which belongs to few authors:
[
{ id: 1, author_id: 'mark', content: [...] },
{ id: 2, author_id: 'pierre', content: [...] },
{ id: 3, author_id: 'pierre', content: [...] },
{ id: 4, author_id: 'mark', content: [...] },
{ id: 5, author_id: 'william', content: [...] },
...
]
I'd like to retrieve and paginate a distinct selection of best matching document based upon the author's id:
[
{ id: 1, author_id: 'mark', content: [...], _score: 100 },
{ id: 3, author_id: 'pierre', content: [...], _score: 90 },
{ id: 5, author_id: 'william', content: [...], _score: 80 },
...
]
Here's what I'm currently doing (pseudo-code):
unique_docs = res.results.to_a.uniq{ |doc| doc.author_id }
Problem is right on pagination: How to select 20 "distinct" documents?
Some people are pointing term facets, but I'm not actually doing a tag cloud:
Distinct selection with CouchDB and elasticsearch
http://elasticsearch-users.115913.n3.nabble.com/Getting-Distinct-Values-td3830953.html
Thanks,
Adit
As at present ElasticSearch does not provide a group_by equivalent, here's my attempt to do it manually.
While the ES community is working for a direct solution to this problem (probably a plugin) here's a basic attempt which works for my needs.
Assumptions.
I'm looking for relevant content
I've assumed that first 300 docs are relevant, so I consider
restricting my research to this selection, regardless many or some
of these are from the same few authors.
for my needs I didn't "really" needed full pagination, it was enough
a "show more" button updated through ajax.
Drawbacks
results are not precise
as we take 300 docs per time we don't know how many unique docs will come out (possibly could be 300 docs from the same author!). You should understand if it fits your average number of docs per author and probably consider a limit.
you need to do 2 queries (waiting remote call cost):
first query asks for 300 relevant docs with just these fields: id & author_id
retrieve full docs of paginated ids in a second query
Here's some ruby pseudo-code: https://gist.github.com/saxxi/6495116
Now the 'group_by' issue have been updated, you can use this feature from elastic 1.3.0 #6124.
If you search for following query,
{
"aggs": {
"user_count": {
"terms": {
"field": "author_id",
"size": 0
}
}
}
}
you will get result
{
"took" : 123,
"timed_out" : false,
"_shards" : { ... },
"hits" : { ... },
"aggregations" : {
"user_count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "mark",
"doc_count" : 87350
}, {
"key" : "pierre",
"doc_count" : 41809
}, {
"key" : "william",
"doc_count" : 24476
} ]
}
}
}
{
CONTENT1:{
YDXM:[{
"name":"1",
"MBNH":"1"}
{"name":"2",
"MBNH":"2"}]
}
I want to delete the {"name":"1","MBNH":"1"}. How can I achieve this?
Assuming that the following is your document and you want to delete the ENTIRE document:
{
"CONTENT1": {
"YDXM": [
{
"name": "1",
"MBNH": "1"
},
{
"name": "2",
"MBNH": "2"
}
]
}
}
You could use this:
db.test.remove({"CONTENT1.YDXM.name" : "1", "CONTENT1.YDXM.MBNH" : "1"})
Now, if you want to extract the document {"name" : "1", "MBNH" : "1"} from the CONTENT1.YDXM array, you should use the $pull operator:
db.test.update({"CONTENT1.YDXM.name" : "1", "CONTENT1.YDXM.MBNH" : "1"}, { $pull : { "CONTENT1.YDXM" : {"name" : "1", "MBNH" : "1"} } }, false, true)
This will perform an update in all documents that matches with the first argument. The second argument, with the $pull operator, means that the mongodb will remove the value {"name" : "1", "MBNH" : "1"} from CONTENT1.YDXM array.
You could read more about $pull operator and the update command in this links:
http://docs.mongodb.org/manual/reference/operator/pull/
http://docs.mongodb.org/manual/applications/update/