Ruby finding duplicates in MongoDB - ruby

I am struggling to get this working efficiently I think map reduce is the answer but can't getting anything working, I know it is probably a simple answer hopefully someone can help
Entry Model looks like this:
field :var_name, type: String
field :var_data, type: String
field :var_date, type: DateTime
field :external_id, type: Integer
If the external data source malfunctions we get duplicate data. One way to stop this was when consuming the results we check if a record with the same external_id already exists, as one we have already consumed. However this is slowing down the process a lot. The plan now is to check for duplicates once a day. So we are looking get a list of Entries with the same external_id. Which we can then sort and delete those no longer needed.
I have tried adapting the snippet from here https://coderwall.com/p/96dp8g/find-duplicate-documents-in-mongoid-with-map-reduce as shown below but get
failed with error 0: "exception: assertion src/mongo/db/commands/mr.cpp:480"
def find_duplicates
map = %Q{
function() {
emit(this.external_id, 1);
}
}
reduce = %Q{
function(key, values) {
return Array.sum(values);
}
}
Entry.all.map_reduce(map, reduce).out(inline: true).each do |entry|
puts entry["_id"] if entry["value"] != 1
end
end
Am I way off? Could anyone suggest a solution? I am using Mongiod, Rails 4.1.6 and Ruby 2.1

I got it working using the suggestion in the comments of the question by Stennie using the Aggregation framework. It looks like this:
results = Entry.collection.aggregate([
{ "$group" => {
_id: { "external_id" => "$external_id"},
recordIds: {"$addToSet" => "$_id" },
count: { "$sum" => 1 }
}},
{ "$match" => {
count: { "$gt" => 1 }
}}
])
I then loop through the results and delete any unnecessary entries.

Related

Using event field as hash variable

I'm receving events in Logstash containing measurement, values and tags. I do not know ahead of time what field there are and what tags. So i wanted to do something like this:
input {
http {}
}
filter {
ruby {
code => '
tags = event.get("stats_tags").split(",")
samples = event.get("stats_samples").split(" ")
datapoints = {}
samples.each {|s|
splat = s.split(" ")
datapoints[splat[0]] = splat[1]
}
event.set("[#metadata][stats-send-as-tags]", tags)
event.set("[#metadata][stats-datapoints]", datapoints)
'
}
}
output {
influxdb {
host => "influxdb"
db => "events_db"
measurement => measurement
send_as_tags => [#metadata][stats-send-as-tags]
data_points => [#metadata][stats-datapoints]
}
}
But this produce error. After much googling to no avail i'm starting to think this is imposible.
Is there a way to pass hash and array from event field to output/filter configuration?
EDIT: If i doublequote it, the error i'm getting is
output {
influxdb {
# This setting must be a hash
# This field must contain an even number of items, got 1
data_points => "[#metadata][stats-datapoints]"
...
}
}

Ruby mongoid aggregation return object

I am doing an mongodb aggregation using mongoid, using ModleName.collection.aggregate(pipeline) . The value returned is an array and not a Mongoid::Criteria, so if a do a first on the array, I get the first element which is of the type BSON::Document instead of ModelName. As a result, I am unable to use it as a model.
Is there a method to return a criteria instead of an array from the aggregation, or convert a bson document to a model instance?
Using mongoid (4.0.0)
I've been struggling with this on my own too. I'm afraid you have to build your "models" on your own. Let's take an example from my code:
class Searcher
# ...
def results(page: 1, per_page: 50)
pipeline = []
pipeline <<
"$match" => {
title: /#{#params['query']}/i
}
}
geoNear = {
"near" => coordinates,
"distanceField" => "distance",
"distanceMultiplier" => 3959,
"num" => 500,
"spherical" => true,
}
pipeline << {
"$geoNear" => geoNear
}
count = aggregate(pipeline).count
pipeline << { "$skip" => ((page.to_i - 1) * per_page) }
pipeline << { "$limit" => per_page }
places_hash = aggregate(pipeline)
places = places_hash.map { |attrs| Offer.new(attrs) { |o| o.new_record = false } }
# ...
places
end
def aggregate(pipeline)
Offer.collection.aggregate(pipeline)
end
end
I've omitted a lot of code from original project, just to present the way what I've been doing.
The most important thing here was the line:
places_hash.map { |attrs| Offer.new(attrs) { |o| o.new_record = false } }
Where both I'm creating an array of Offers, but additionally, manually I'm setting their new_record attribute to false, so they behave like any other documents get by simple Offer.where(...).
It's not beautiful, but it worked for me, and I could take the best of whole Aggregation Framework!
Hope that helps!

Rails4 + Json API : Increase detail of response

I'm a newbie on RoR (and Ruby). I need a little help about a json response (with Grape).
This is the sample:
{
events: [
{
'some data':'some data',
place_id: 1
}
]
}
Now this is the result of Events.all in Rails, but I want to make for each event a query for the place, to have more data instead only id. I'm sure that new lambda function can help me, but for now I have no idea about to make it. I'm trying without success...
Thanks in advance
UPDATE
Desired result
{
events: [
{
'some data':'some data',
place : {
id: 1,
name: 'Blablabla'
}
]
}
Consider using ActiveModelSerializers which allows you to define how your models should be serialized in a manner similar to ActiveRecord DSL (e.g. your problem would be solved by defining that event has_one :place)
:events => events.as_json(include: :place)
This is a useful for my problem. After add belongs_to, obviously.
from http://api.rubyonrails.org/classes/ActiveModel/Serializers/JSON.html

tire terms filter not working

I'm trying to achieve a "scope-like" function with tire/elasticsearch. Why is this not working, even when i have entries with status "Test1" or "Test2"? The results are always empty.
collection = #model.search(:page => page, :per_page => per_page) do |s|
s.query {all}
s.filter :terms, :status => ["Test1", "Test2"]
s.sort {by :"#{column}", "#{direction}"}
end
The method works fine without the filter. Is something wrong with the filter method?! I've checked the tire doku....it should work.
Thanks! :)
Your issue is most probably being caused by using the default mappings for the status field, which would tokenize it -- downcase, split into words, etc.
Compare these two:
http://localhost:9200/myindex/_analyze?text=Text1&analyzer=standard
http://localhost:9200/myindex/_analyze?text=Text1&analyzer=keyword
The solution in your case is to use the keyword analyzer (or set the field to not_analyzed) in your mapping. When the field would not be an “enum” type of data, you could use the multi-field feature.
A working Ruby version would look like this:
require 'tire'
Tire.index('myindex') do
delete
create mappings: {
document: {
properties: {
status: { type: 'string', analyzer: 'keyword' }
}
}
}
store status: 'Test1'
store status: 'Test2'
refresh
end
search = Tire.search 'myindex' do
query do
filtered do
query { all }
filter :terms, status: ['Test1']
end
end
end
puts search.results.to_a.inspect
Note: It's rarely possible -- this case being an exception -- to offer reasonable advice when no index mappings, example data, etc. are provided.

Mongoid Complex Query Including Embedded Docs

I have a model with several embedded models. I need to query for a record to see if it exists. the issue is that I will have to include reference to multiple embedded documents my query would have to include the following params:
{
"first_name"=>"Steve",
"last_name"=>"Grove",
"email_addresses"=>[
{"type"=>"other", "value"=>"steve#stevegrove.com", "primary"=>"true"}
],
"phone_numbers"=>[
{"type"=>"work_fax", "value"=>"(720) 555-0631"},
{"type"=>"home", "value"=>"(303) 555-1978"}
],
"addresses"=>[
{"type"=>"work", "street_address"=>"6390 N Main Street", "city"=>"Elbert", "state"=>"CO"}
],
}
How can I query for all the embedded docs even though some fields are missing such as _id and associations?
A few things to think about.
Are you sure the query HAS to contain all these parameters? Is there not a subset of this information that uniquely identifies the record? Say (first_name, last_name, and an email_addresses.value). It would be silly to query all the conditions if you could accomplish the same thing in less work.
In Mongoid the where criteria allows you to use straight javascript, so if you know how to write the javascript criteria you could just pass a string of javascript to where.
Else you're left writing a really awkward where criteria statement, thankfully you can use the dot notation.
Something like:
UserProfile.where(first_name: "Steve",
last_name: "Grove",
:email_addresses.matches => {type: "other",
value: "steve#stevegrove.com",
primary: "true"},
..., ...)
in response to the request for embedded js:
query = %{
function () {
var email_match = false;
for(var i = 0; i < this.email_addresses.length && !email_match; i++){
email_match = this.email_addresses[i].value === "steve#stevegrove.com";
}
return this.first_name === "Steve" &&
this.last_name === "Grove" &&
email_match;
}
}
UserProfile.where(query).first
It's not pretty, but it works
With Mongoid 3 you could use elem_match http://mongoid.org/en/origin/docs/selection.html#symbol
UserProfile.where(:email_addresses.elem_match => {value: 'steve#stevegrove.com', primary: true})
This assumes
class UserProfile
include Mongoid::Document
embeds_many :email_addresses
end
Now if you needed to include every one of these fields, I would recommend using the UserProfile.collection.aggregate(query). In this case you could build a giant hash with all the fields.
query = { '$match' => {
'$or' => [
{:email_addresses.elem_match => {value: 'steve#stevegrove.com', primary: true}}
]
} }
it starts to get a little crazy, but hopefully that will give you some insight into what your options might be. https://coderwall.com/p/dtvvha for another example.

Resources