How do I find and remove duplicate mongo documents with ruby - ruby

I have a collection in Mongo with duplicates on a specific key that I need to remove all but one of. The Map Reduce solutions don't seem to make it clear how to remove all but one of the duplicates. I am using Ruby, how can I do this in a somewhat efficient way? My current solution is unbelievably slow!
I currently just iterate over an array of the duplicate keys and delete the first document that is returned but this only works if there are at most 1 duplicate document for each key and it is really slow.
dupes.each do |key|
$mongodb.collection("some_collection").remove($mongodb.collection("some_collection").find({key: key}).first)
end

I think you should use the MongoDB ensureIndex() to remove the duplicates. For instance, in your case, you want to drop the duplicate documents give the key duplicate_key, you can do
db.duplicate_collection.ensureIndex({'duplicate_key' : 1},{unique: true, dropDups: true})
where duplicate_collection is the collection where your duplicate documents are. This operation will only preserve single document if there are duplicate documents give a particular key.
After the operation, if you think you want to remove the index, just do the dropIndex operation. For details, you can search the mongodb documentation.

A lot of solutions suggest Map Reduce (which is fast and fine) but I implemented a solution in Ruby that seems pretty fast as well and makes it easy to leave the one document from each duplicate set.
Basically you find all your duplicate keys by adding them to a hash and any time you find a duplicate key in the collection you add the id of that document to an array which you will use in a bulk removal at the end.
all_keys = {}
dupes = []
dupe_key = "some_key"
$mongodb.collection("some_collection").find.each do |doc|
all_keys[doc[dupe_key]].present? ? dupes << doc["_id"] : asins[doc[dupe_key]] = 1
end
$mongodb.collection("some_collection").remove({_id: {"$in" => dupes } })
The only issue with this method is that it potentially won't work if the total list of keys/dupe ids can't be stored in memory. The map reduce solution would probably be best at that point.

Related

Using redis to store a structured event log

I'm a bit new to Redis, so please forgive if this is basic.
I'm working on an app that sends automatic replies to users for certain events. I would like to use Redis to store who has received what event.
Essentially, in ruby, the data structure could look like this where you have a map of users to events and the dates that each event was sent.
{
"mary#example.com" => {
"sent_comment_reply" => ["12/12/2014", "3/6/2015"],
"added_post_reply" => ["1/4/2006", "7/1/2016"]
}
}
What is the best way to represent this in a Redis data structure so you can ask, did Mary get a sent_comment_reply? and if so, when was the latest?
In short, the question is, how(if possible) can you have a Hash structure that holds an array in Redis.
The rationale as opposed to using a set or list with a compound key is that hashes have O(1) lookup time, whereas lookups on lists(lrange) and sets(smembers) will be O(s+n) and sets O(n), respectively.
One way of structuring it in Redis, depending on the idea that you know the events of the user and you want the latest to be fresh in memory :
A sorted set per user. the content of the sorted set will be event codes; sent_comment_reply, added_post_reply with the score of the latest event as the highest. you can use ZRANK to get the answer for the question :
Did Mary get a sent_comment_reply?
A hash also for the user, this time you will have the field as the event sent_comment_reply and the value is the content of it which should be updated with the latest value including the body, date, etc. this will answer the question:
and if so, when was the latest?
Note: Sorted sets are really fast , and in this example we are depending on the events as the data.
With sorted sets you can add, remove, or update elements in a very
fast way (in a time proportional to the logarithm of the number of
elements). Since elements are taken in order and not ordered
afterwards, you can also get ranges by score or by rank (position) in
a very fast way. Accessing the middle of a sorted set is also very
fast, so you can use Sorted Sets as a smart list of non repeating
elements where you can quickly access everything you need: elements in
order, fast existence test, fast access to elements in the middle!
A possible approach to use a hash to map an array is as follows:
add_element(key , value):
len := redis.hlen(key)
redis.hset(key , len , value)
this will map array[i] element to i field in a hash key.
this will work for some cases, but I would probably go with the answer suggested in https://stackoverflow.com/a/34886801/2868839

Delete entire voltrb store collection

store.widgets.clear does not seem to save into the Database.
So I tried:
store.widgets.each do |i|
i.destroy
end
And this destroys only half of the records in the DB.
Any suggestions on how I can remove an entire store collection?
I need to add a .destroy_all (or make .clear do that), but haven't gotten around to it. The issue your seeing with .each is that its iterating through the array while deleting. For now you can do:
store.widgets.reverse.each(&:destroy)
I'll prioritize the destroy_all.

Condense nested for loop to improve processing time with text analysis python

I am working on an untrained classifier model. I am working in Python 2.7. I have a loop. It looks like this:
features = [0 for i in xrange(len(dictionary))]
for bgrm in new_scored:
for i in xrange(len(dictionary)):
if bgrm[0] == dictionary[i]:
features[i] = int(bgrm[1])
break
I have a "dictionary" of bigrams that I have collected from a data set containing customer reviews and I would like to construct feature arrays of each review corresponding to the dictionary I have created. It would contain the frequencies of the bigrams found within the review of the features in the dictionary (I hope that makes sense). new_scored is a list of tuples which contains the bigrams found within a particular review paired with their relative frequency of occurrence in that review. The final feature arrays will be the same length as the original dictionary with few non zero entries.
The above works fine but I am looking at a data set of 13000 reviews, for each review to loop through this code is going to take for eeever (if my computer doesnt run out of RAM first). I have been sitting with it for a while and cannot see how I can condense it.
I am very new to python so I was hoping a more experienced could help with condensing it or perhaps point me in the right direction towards a library that will contain the function I need.
Thank you in advance!
Consider making dictionary an actual dict object (or some fancier subclass of dict if it better suits your needs), as opposed to an iterable (list or tuple seems like what it is now). dictionary could map bigrams as keys to an integer identifier that would identify a feature position.
If you refactor dictionary that way, then the loop can be rewritten as:
features = [0 for key in dictionary]
for bgram in new_scored:
try:
features[dictionary[bgram[0]]] = int(bgrm[1])
except KeyError:
# do something if the bigram is not in the dictionary for some reason
This should convert what was an O(n) traversal through dictionary into a hash lookup.
Hope this helps.

Sorting by counting the intersection of two lists in MongoDB

We have a posting analyzing requirement, that is, for a specific post, we need to return a list of posts which are mostly related to it, the logic is comparing the count of common tags in the posts. For example:
postA = {"author":"abc",
"title":"blah blah",
"tags":["japan","japanese style","england"],
}
there are may be other posts with tags like:
postB:["japan", "england"]
postC:["japan"]
postD:["joke"]
so basically, postB gets 2 counts, postC gets 1 counts when comparing to the tags in the postA. postD gets 0 and will not be included in the result.
My understanding for now is to use map/reduce to produce the result, I understand the basic usage of map/reduce, but I can't figure out a solution for this specific purpose.
Any help? Or is there a better way like custom sorting function to work it out? I'm currently using the pymongodb as I'm python developer.
You should create an index on tags:
db.posts.ensure_index([('tags', 1)])
and search for posts that share at least one tag with postA:
posts = list(db.posts.find({_id: {$ne: postA['_id']}, 'tags': {'$in': postA['tags']}}))
and finally, sort by intersection in Python:
key = lambda post: len(tag for tag in post['tags'] if tag in postA['tags'])
posts.sort(key=key, reverse=True)
Note that if postA shares at least one tag with a large number of other posts this won't perform well, because you'll send so much data from Mongo to your application; unfortunately there's no way to sort and limit by the size of the intersection using Mongo itself.

data structure to support lookup based on full key or part of key

I need to be able to lookup based on the full key or part of the key..
e.g. I might store keys like 10,20,30,40 11,12,30,40, 12,20,30,40
I want to be able to search for 10,20,30,40 or 20,30,40
What is the best data structure for achieving this..best for time.
our programming language is Java..any pointers for open source projects will be appreciated..
Thanks in advance..
If those were the actual numbers I'd be working with, I'd use an array where a given index contains an array of all records that contain the index. If the actual numbers were larger, I'd use a hash table employed the same way.
So the structure would look like (empty indexes elided, in the case of the array implementation):
10 => ((10,20,30,40)),
11 => ((11,12,30,40)),
12 => ((11,12,30,40), (12,20,30,40)),
20 => ((10,20,30,40), (12,20,30,40)),
30 => ((10,20,30,40), (11,12,30,40), (12,20,30,40)),
40 => ((10,20,30,40), (11,12,30,40), (12,20,30,40)),
It's not clear to me whether your searches are inclusive (OR-based) or exclusive (AND-based), but either way you look up the record groups for each element of the search set; for the inclusive search you find their union, and for the exclusive search you find their intersection.
Since you seen to care about retrieval time over other concerns (such as space), I suggest you use a hashtable and you enter your items several times, once per subkey. So you'd put("10,20,30,40",mydata), then put("20,30,40",mydata) and so on (of course this would be a method, you're not going to manually call put so many times).
Use a tree structure. Here is an open source project that might help ... written in Java :-)
http://suggesttree.sourceforge.net/

Resources