Evaluating Ruby and Mongo performance: 1.6M small records in file, takes 15 min to write 800K records to Mongo? More efficient way? - ruby

We have 1.6M records in a flat file. Each record contains three or four short strings of fewer than 100 characters.
We only need 800K of these records. We write these records to a Mongo collection. The other 800K are ignored.
It takes about 15 min to process the file, meaning we process about 1.67K records/second. Is this expected performance, or should the process be much faster (e.g., 5K records/second, 10K records/second)?
Code below (#skip is a hash of about 800K app IDs).
def updateApplicationDeviceTypes(dir, limit)
puts "Updating Application Data (Pass 3 - Device Types)..."
file = File.join(dir, '/application_device_type')
cols = getColumns(file)
device_type_id_col = cols[:device_type_id]
update = Proc.new do |id, group|
#applications_coll.update(
{ "itunes_id" => id },
{ :$set => { "devices" => group } }
# If all records for one id aren't adjacent, you'll need this instead
#{ :$addToSet => { "devices" => { :$each => group } } }
) unless !id or #skip[id.intern]
end
getValue = Proc.new { |r| r[device_type_id_col] }
batchRecords(file, cols[:application_id], update, getValue, limit)
end
# result to an array, before calling "update" on the array/id
def batchRecords(filename, idCol, update, getValue, limit=nil)
current_id = nil
current_group = []
eachRecord(filename, limit) do |r|
id = r[idCol]
value = getValue.call(r)
if id == current_id and !value.nil?
current_group << value
else
update.call(current_id, current_group) unless current_id.nil?
current_id = id
current_group = value.nil? ? [] : [value]
end
end
# Since the above is only called once for each row, we still
# have one group to update.
update.call(current_id, current_group)
end

Schema design and the read and write patterns of your application play an extremely large role in your applications performance.
I'd recommend enabling the profiler before looking at machine and IO level performance:
http://docs.mongodb.org/manual/tutorial/manage-the-database-profiler/
You may also find these talks from last year's MongoSV useful:
http://www.10gen.com/presentations/mongosv-2012/lessons-field-performance-operations
http://www.10gen.com/presentations/mongosv-2012/mongodb-performance-tuning

Related

How to use scopes to filter data by partial variable

I have some working script filtering my results with Active Record Scoping. Everything works fine when i want to filter by comparing params with data from database.
But i have some function counting car price inside _car.html.erb partial, the result of this function depends on params result.
How can i scope search results by result of this function and show only cars which are under some price (defined in params).
Some code to make it more clear:
car.rb (model file)
scope :price_leasing, -> (price_leasing) { where('price_leasing <= ?', price_leasing) }
# for now price_leasing is getting price from database
scope :brand, -> (brand) { where brand: brand }
scope :car_model, -> (car_model) { where car_model: car_model }
scope :category, -> (category) { where category: category }
cars_controller.rb (controller file)
def index
#cars = Car.where(nil)
#cars = #cars.price_leasing(params[:price_leasing]) if params[:price_leasing].present?
#cars = #cars.brand(params[:brand]) if params[:brand].present?
#cars = #cars.car_model(params[:car_model]) if params[:car_model].present?
#cars = #cars.category(params[:category]) if params[:category].present?
#brands = Brand.all # importing all car brands into filters
end
in index.html.erb i have "render #cars" code
<%=
if #cars.size > 0
render #cars.where(:offer_status => 1)
else
render html: '<p>Nie znaleziono pasujących wyników.</p>'.html_safe
end
%>
inside _car.html.erb file i have function from helper
<h3 class="car-cell__price"><%= calculate_car_price(car.price, car.id) %> <span class="car-cell__light-text">zł/mc</span></h3>
my calculate_car_price() function inside helper
def calculate_car_price(car_price, car_id)
car = Car.find(car_id)
fullprice = car_price
if params[:price_leasing].present?
owncontribution = params[:price_leasing].to_i
else
owncontribution = car.owncontribution
end
pv = fullprice - owncontribution + (0.02 * fullprice)
if params[:period].present?
carperiod = params[:period].to_i
carprice = (Exonio.pmt(0.0522/60, carperiod, pv)) * -1
else
carprice = (Exonio.pmt(0.0522/60, 60, pv)) * -1
end
p number_with_precision(carprice, precision: 0)
end
i would love to scope by the result of this function. Is it possible?
The thing about scopes are that they are implemented at the database level. You want to select records that have a value
To do what you want at the DB level, you would need to a virtual column in the database, the implementation will change based on which database product you're using (I would not expect a virtual column definition in postgreSQL to be the same as a virtual column definition in mySQL)
So I'm not sure there's an optimal way to do this.
I would suggest you build your own class method and instance method in your model Car. It would be less performant but easier to understand and implement.
def self.car_price_under(target_price, params)
select { |v| v.car_price_under?(target_price, params) }
end
def car_price_under?(target_price, params)
full_price = price
if params[:price_leasing].present?
my_own_contribution = params[:price_leasing].to_i
else
my_own_contribution = owncontribution
end
pv = full_price - my_own_contribution + (0.02 * full_price)
if params[:period].present?
car_period = params[:period].to_i
new_car_price = (Exonio.pmt(0.0522/60, car_period, pv)) * -1
else
new_car_price = (Exonio.pmt(0.0522/60, 60, pv)) * -1
end
new_car_price <= target_price
end
This would let you do...
#cars = Car.where(nil)
#cars = #cars.brand(params[:brand]) if params[:brand].present?
#cars = #cars.car_model(params[:car_model]) if params[:car_model].present?
#cars = #cars.category(params[:category]) if params[:category].present?
#cars = #cars.car_price_under(target_price, params) if target_price.present?
Note that this is happening in rails, NOT in the database, so the car_price_under method should be called after all other scopes to minimise the number of records that need to be examined. You cannot chain additional scopes as #cars would be an array... if you want to be able to chain additional scopes (or you want an active record relation, not an array) you could do something like #cars = Car.where(id: #cars.pluck(:id))

How to Sort List of Objects Based on a Property Value

I am using a CriteraBuilder to return a list of objects. I want to sort this list (hopefully in the query) by a property value in the object if it equals status "PENDING". The statuses on the object can be "Valid, Expired, or Pending". The objects with status "Pending" I want to place first in the returned list. Note I want to be able to paginate this list.
CRITERIA
def getAllIds(Map opts = [:]) {
def max = opts.max ?: 10
def offset = opts.offset ?: 0
def c = Identification.createCriteria()
List<Identification> ids = c.list(max: max, offset: offset) {
//sort here if status == "PENDING"
}
return ids
}
You can use 'withCriteria' instead of 'createCrkteria'+'list' and use 'order' into it
See http://docs.grails.org/3.1.1/ref/Domain%20Classes/withCriteria.html

MongoDB + Ruby: updating records in an iteration

Using MongoDB and the Ruby driver, I'm trying to calculate the rankings for players in my app, so I'm sorting by (in this case) pushups, and then adding a rank field and value per object.
pushups = coll.find.sort(["pushups", -1] )
pushups.each_with_index do |r, idx|
r[:pushups_rank] = idx + 1
coll.update( {:id => r }, r, :upsert => true)
coll.save(r)
end
This approach does work, but is this the best way to iterate over objects and update each one? Is there a better way to calculate a player's rank?
Another approach would be to do the entire update on the server by executing a javascript function:
update_rank = "function(){
var rank=0;
db.players.find().sort({pushups:-1}).forEach(function(p){
rank +=1;
p.rank = rank;
db.players.save(p);
});
}"
cn.eval( update_rank )
(Code assumes you have a "players" collection in mongo, and a ruby variable cn that holds a conection to your database)

increment value in a hash

I have a bunch of posts which have category tags in them.
I am trying to find out how many times each category has been used.
I'm using rails with mongodb, BUT I don't think I need to be getting the occurrence of categories from the db, so the mongo part shouldn't matter.
This is what I have so far
#recent_posts = current_user.recent_posts #returns the 10 most recent posts
#categories_hash = {'tech' => 0, 'world' => 0, 'entertainment' => 0, 'sports' => 0}
#recent_posts do |cat|
cat.categories.each do |addCat|
#categories_hash.increment(addCat) #obviously this is where I'm having problems
end
end
end
the structure of the post is
{"_id" : ObjectId("idnumber"), "created_at" : "Tue Aug 03...", "categories" :["world", "sports"], "message" : "the text of the post", "poster_id" : ObjectId("idOfUserPoster"), "voters" : []}
I'm open to suggestions on how else to get the count of categories, but I will want to get the count of voters eventually, so it seems to me the best way is to increment the categories_hash, and then add the voters.length, but one thing at a time, i'm just trying to figure out how to increment values in the hash.
If you aren't familiar with map/reduce and you don't care about scaling up, this is not as elegant as map/reduce, but should be sufficient for small sites:
#categories_hash = Hash.new(0)
current_user.recent_posts.each do |post|
post.categories.each do |category|
#categories_hash[category] += 1
end
end
If you're using mongodb, an elegant way to aggregate tag usage would be, to use a map/reduce operation. Mongodb supports map/reduce operations using JavaScript code. Map/reduce runs on the db server(s), i.e. your application does not have to retrieve and analyze every document (which wouldn't scale well for large collections).
As an example, here are the map and reduce functions I use in my blog on the articles collection to aggregate the usage of tags (which is used to build the tag cloud in the sidebar). Documents in the articles collection have a key named 'tags' which holds an array of strings (the tags)
The map function simply emits 1 on every used tag to count it:
function () {
if (this.tags) {
this.tags.forEach(function (tag) {
emit(tag, 1);
});
}
}
The reduce function sums up the counts:
function (key, values) {
var total = 0;
values.forEach(function (v) {
total += v;
});
return total;
}
As a result, the database returns a hash that has a key for every tag and its usage count as a value. E.g.:
{ 'rails' => 5, 'ruby' => 12, 'linux' => 3 }

Lua - Sorting a table alphabetically

I have a table that is filled with random content that a user enters. I want my users to be able to rapidly search through this table, and one way of facilitating their search is by sorting the table alphabetically. Originally, the table looked something like this:
myTable = {
Zebra = "black and white",
Apple = "I love them!",
Coin = "25cents"
}
I was able to implement a pairsByKeys() function which allowed me to output the tables contents in alphabetical order, but not to store them that way. Because of the way the searching is setup, the table itself needs to be in alphabetical order.
function pairsByKeys (t, f)
local a = {}
for n in pairs(t) do
table.insert(a, n)
end
table.sort(a, f)
local i = 0 -- iterator variable
local iter = function () -- iterator function
i = i + 1
if a[i] == nil then
return nil
else
return a[i], t[a[i]]
end
end
return iter
end
After a time I came to understand (perhaps incorrectly - you tell me) that non-numerically indexed tables cannot be sorted alphabetically. So then I started thinking of ways around that - one way I thought of is sorting the table and then putting each value into a numerically indexed array, something like below:
myTable = {
[1] = { Apple = "I love them!" },
[2] = { Coin = "25cents" },
[3] = { Zebra = "black and white" },
}
In principle, I feel this should work, but for some reason I am having difficulty with it. My table does not appear to be sorting. Here is the function I use, with the above function, to sort the table:
SortFunc = function ()
local newtbl = {}
local t = {}
for title,value in pairsByKeys(myTable) do
newtbl[title] = value
tinsert(t,newtbl[title])
end
myTable = t
end
myTable still does not end up being sorted. Why?
Lua's table can be hybrid. For numerical keys, starting at 1, it uses a vector and for other keys it uses a hash.
For example, {1="foo", 2="bar", 4="hey", my="name"}
1 & 2, will be placed in a vector, 4 & my will be placed in a hashtable. 4 broke the sequence and that's the reason for including it into the hashtable.
For information on how to sort Lua's table take a look here: 19.3 - Sort
Your new table needs consecutive integer keys and needs values themselves to be tables. So you want something on this order:
SortFunc = function (myTable)
local t = {}
for title,value in pairsByKeys(myTable) do
table.insert(t, { title = title, value = value })
end
myTable = t
return myTable
end
This assumes that pairsByKeys does what I think it does...

Resources