I have class Call - it represents single phonecall with certain number of minutes/seconds, date of the call etc. I want to sum length of calls for given day.
Problem is my data is in string format, I'm formatting it with various Time.parse options and many different things.
But my main problem is, how to sum them? I need something like Ruby's inject/reduce but smart enough to know 60 seconds is one minute.
One additional problem is I'm reading from .CSV file, turning every row into Hash, and making Call objects out of it.
Any hints? :)
I suggest to store the duration of a call as a number of second in an integer. Because that would allow you to easily run calculation in the database.
But if you prefer to keep the string representation you might want to use something like this:
# assuming `calls` is an array of call instances and the
# duration of the call is stores an attribute `duration`
total = calls.sum do |call|
minutes, seconds = call.duration.split(':')
minutes * 60 + seconds
end
# format output
"#{total / 60}:#{total % 60}"
Please note that the sum method is part of ActiveSupport. When you are using pure Ruby without Rails you need to use this instead:
total = calls.inject(0) do |sum, call|
minutes, seconds = call.duration.split(':')
sum + minutes * 60 + seconds
end
You may could map them all to seconds represented by Float via Time.parse, then inject the mapped float array.
Related
I am trying to reduce the size of my data and I cannot make it work. I have data points taken every minute over 1 month. I want to reduce this data to have one sample for every hour. The problem is: Some of my runs have "NA" value, so I delete these rows. There is not exactly 60 points for every hour - it varies.
I have a 'Timestamp' column. I have used this to make a 'datehour' column which has the same value if the data set has the same date and hour. I want to average all the values with the same 'datehour' value.
How can I do this? I have tried using the if and for loop below, but it takes so long to run.
Thanks for all your help! I am new to Julia and come from a Matlab background.
======= CODE ==========
uniquedatehour=unique(datehour,1)
index=[]
avedata=reshape([],0,length(alldata[1,:]))
for j in uniquedatehour
for i in 1:length(datehour)
if datehour[i]==j
index=vcat(index,i)
else
rows=alldata[index,:]
rows=convert(Array{Float64,2},rows)
avehour=mean(rows,1)
avedata=vcat(avedata,avehour)
index=[]
continue
end
end
end
There are several layers to optimizing this code. I am assuming that your data is sorted on datehour (your code assumes this).
Layer one: general recommendation
Wrap your code in a function. Executing code in global scope in Julia is much slower than within a function. By wrapping it make sure to either pass data to your function as arguments or if data is in global scope it should be qualified with const;
Layer two: recommendations to your algorithm
Statement like [] creates an array of type Any which is slow, you should use type qualifier like index=Int[] to make it fast;
Using vcat like index=vcat(index,i) is inefficient, it is better to do push!(index, i) in place;
It is better to preallocate avedata with e.g. fill(NA, length(uniquedatehour), size(alldata, 2)) and assign values to an existing matrix than to do vcat on it;
Your code will produce incorrect results if I am not mistaken as it will not catch the last entry of uniquedatehour vector (assume it has only one element and check what happens - avedata will have zero rows)
Line rows=convert(Array{Float64,2},rows) is probably not needed at all. If alldata is not Matrix{Float64} it is better to convert it at the beginning with Matrix{Float64}(alldata);
You can change line rows=alldata[index,:] to a view like view(alldata, index, :) to avoid allocation;
In general you can avoid creation of index vector as it is enough that you remember start s and end e position of the range of the same values and then use range s:e to select rows you want.
If you correct those things please post your updated code and maybe I can help further as there is still room for improvement but requires a bit different algorithmic approach (but maybe you will prefer option below for simplicity).
Layer three: how I would do it
I would use DataFrames package to handle this problem like this:
using DataFrames
df = DataFrame(alldata) # assuming alldata is Matrix{Float64}, otherwise convert it here
df[:grouping] = datehour
agg = aggregate(df, :grouping, mean) # maybe this is all what you need if DataFrame is OK for you
Matrix(agg[2:end]) # here is how you can convert DataFrame back to a matrix
This is not the fastest solution (as it converts to a DataFrame and back but it is much simpler for me).
I currently have two time values in two separate expressions in SSRS which I would like to subtract one from the other to give me a sub total time.
At present value 1 is of 163:02:38 and the expression is as follows:
=System.Math.Floor(Sum(Fields!Staffed_Time.Value) / 3600) & ":" & Microsoft.VisualBasic.Strings.Format(Microsoft.VisualBasic.DateAndTime.DateAdd("s", Sum(Fields!Staffed_Time.Value), "00:00"), "mm:ss")
While value 2 is of 5:12:46
=System.Math.Floor(Sum(Fields!Time_in_Default.Value) / 3600) & ":" & Microsoft.VisualBasic.Strings.Format(Microsoft.VisualBasic.DateAndTime.DateAdd("s", Sum(Fields!Time_in_Default.Value), "00:00"), "mm:ss")
Meaning that the sub total I desire would be 157:49:52
Now when I use this expression
=(System.Math.Floor(Sum(Fields!Staffed_Time.Value) / 3600) - System.Math.Floor(Sum(Fields!Time_in_Default.Value) / 3600)) & ":" & Microsoft.VisualBasic.Strings.Format(Microsoft.VisualBasic.DateAndTime.DateAdd("s", Sum(Fields!Staffed_Time.Value), "00:00"), "mm:ss")
It only subtracts the hour values which in this case would remove 5 hours, therefore only leaving me with a sub total of 158:02:38
Therefore how can I get the expression to also subtract the minutes and seconds to get the desired subtotal?
A better solution to building a large expression would be to add the following custom code to your report:
Public Function ConvertSecondsToTime(seconds As Integer) As String
Dim ts as TimeSpan = TimeSpan.FromSeconds(seconds)
return Floor(ts.TotalHours).ToString() + ":" + ts.Minutes.ToString() + ":" + ts.Seconds.ToString()
End Function
And use the custom code in an expression like so:
=code.ConvertSecondsToTime(Sum(Fields!Staffed_Time.Value) - Sum(Fields!Time_in_Default.Value))
Your two fields are represented in seconds, so, rather than calculating and subtracting each unit of time (hours, minutes and seconds) separately and applying the custom format, subtract in seconds (which in your given example of 157:49:52 would be 568192 seconds) and then apply the custom format.
The TimeSpan class will take the seconds and convert that into units of time by calling TimeSpan.FromSeconds.
You may wonder why we use TotalHours instead of Hours. Keep in mind that TimeSpan hours are based on a 24-hour clock, so anything outside of that will be treated as days. In comparison, TotalHours
as stated in TimeSpan MSDN documentation represents:
the value of the current TimeSpan structure expressed in whole and fractional hours.
In other words, it represents the whole time in decimal hours. This is comparable to how your current expression is calculating the hours, hence why we use Floor(ts.TotalHours) in the custom code.
Assuming you have access to the database you are far better doing any complex data handling on the database side.
It's faster to develop and easier to debug.
Assuming you present the data via a stored procedure or view it is faster to run as it is compiled.
It's easier to manage changes to the schema in the future.
The view or stored procedure (i.e. you calculation) can be used by others.
Therefore, however you are creating your dataset, I would present a further column called e.g. TimeDifference and pass this into SSRS.
The math in your third code sample is flawed in two ways. First, you are only subtracting the hours and completely disregarding the "minutes and seconds" subtraction. Second, you are rounding the hours before doing the subtraction, which could cause off-by-one issues under certain circumstances.
To solve your particular circumstance, you could fix your following formula like so:
=(System.Math.Floor((Sum(Fields!Staffed_Time.Value) - Sum(Fields!Time_in_Default.Value)) / 3600) & ":" & Microsoft.VisualBasic.Strings.Format(Microsoft.VisualBasic.DateAndTime.DateAdd("s", Sum(Fields!Staffed_Time.Value) - Sum(Fields!Time_in_Default.Value), "00:00"), "mm:ss")
Breaking this down into components:
Subtract Time_in_default from Staffed_Time and then determine the hour count.
Subtract the Time_in_default from Staffed_Time and display the minutes and seconds.
But, to make this even simpler, just use VisualBasic's Strings.Format for the whole piece and avoid the math altogether:
=Microsoft.VisualBasic.Strings.Format(Microsoft.VisualBasic.DateAndTime.DateAdd("s", Sum(Fields!Staffed_Time.Value) - Sum(Fields!Time_in_Default.Value), "00:00"), "HH:mm:ss")
That won't require any custom code and is reasonably easy to understand.
I have a function below which generates a set of combinations for an array, at various lengths, defined by a range. I'd like to be able to get data about the combination process which would include the time required to process the combinations. Given the following:
source = ("a".."z").to_a
range = 1..7
The command to generate the combinations is this:
combinations = (range).flat_map do |size|
source.combination(size).to_a
end
This command takes about 5 seconds to run on my machine, and generates 971,711 combinations. However, when I try to execute this in the context of a function, below:
def combinations(source, range)
time_start = Time.now
combinations = (range).flat_map do |size|
source.combination(size).to_a
end
time_elapsed = (Time.now - time_start).round(1)
puts "Generated #{combinations.count} in #{time_elapsed} seconds."
return combinations
end
source = ("a".."z").to_a
range = 1..7
combinations(source, range)
The function almost immediately outputs:
Generated 971711 in 0.1 seconds.
... and then 5 seconds later returns the combinations. What's going on here? And how can I calculate the duration of the time required to process the combinations?
When I run your code on ruby 2.0.0p247 on an Ubuntu 12.04 32-bit machine, I get the output:
Generated 971711 in 0.6 seconds.
and the program exits immediately after that.
Since there is only one puts line in the program what do you mean by "and then 5 seconds later returns the combinations"? Is there more code that you are not showing us? What ruby interpreter are you running? What operating system? Could you provide the full code if you have not yet?
If you want to look into this more, I recommend trying rblineprof or ruby-prof.
So it looks like the issue here is that the ruby is taking the ~ 5 seconds to be able to load and display the information in IRB, but the "Generated X in Y seconds." information is actually correct and working. It was just less than I was expecting because I was confused about the difference between the time required to calculate the combinations vs the time required to load and start displaying the output of the combinations.
User.find(:all, :order => "RANDOM()", :limit => 10) was the way I did it in Rails 3.
User.all(:order => "RANDOM()", :limit => 10) is how I thought Rails 4 would do it, but this is still giving me a Deprecation warning:
DEPRECATION WARNING: Relation#all is deprecated. If you want to eager-load a relation, you can call #load (e.g. `Post.where(published: true).load`). If you want to get an array of records from a relation, you can call #to_a (e.g. `Post.where(published: true).to_a`).
You'll want to use the order and limit methods instead. You can get rid of the all.
For PostgreSQL and SQLite:
User.order("RANDOM()").limit(10)
Or for MySQL:
User.order("RAND()").limit(10)
As the random function could change for different databases, I would recommend to use the following code:
User.offset(rand(User.count)).first
Of course, this is useful only if you're looking for only one record.
If you wanna get more that one, you could do something like:
User.offset(rand(User.count) - 10).limit(10)
The - 10 is to assure you get 10 records in case rand returns a number greater than count - 10.
Keep in mind you'll always get 10 consecutive records.
I think the best solution is really ordering randomly in database.
But if you need to avoid specific random function from database, you can use pluck and shuffle approach.
For one record:
User.find(User.pluck(:id).shuffle.first)
For more than one record:
User.where(id: User.pluck(:id).sample(10))
I would suggest making this a scope as you can then chain it:
class User < ActiveRecord::Base
scope :random, -> { order(Arel::Nodes::NamedFunction.new('RANDOM', [])) }
end
User.random.limit(10)
User.active.random.limit(10)
While not the fastest solution, I like the brevity of:
User.ids.sample(10)
The .ids method yields an array of User IDs and .sample(10) picks 10 random values from this array.
Strongly Recommend this gem for random records, which is specially designed for table with lots of data rows:
https://github.com/haopingfan/quick_random_records
All other answers perform badly with large database, except this gem:
quick_random_records only cost 4.6ms totally.
the accepted answer User.order('RAND()').limit(10) cost 733.0ms.
the offset approach cost 245.4ms totally.
the User.all.sample(10) approach cost 573.4ms.
Note: My table only has 120,000 users. The more records you have, the more enormous the difference of performance will be.
UPDATE:
Perform on table with 550,000 rows
Model.where(id: Model.pluck(:id).sample(10)) cost 1384.0ms
gem: quick_random_records only cost 6.4ms totally
For MYSQL this worked for me:
User.order("RAND()").limit(10)
You could call .sample on the records, like: User.all.sample(10)
The answer of #maurimiranda User.offset(rand(User.count)).first is not good in case we need get 10 random records because User.offset(rand(User.count) - 10).limit(10) will return a sequence of 10 records from the random position, they are not "total randomly", right? So we need to call that function 10 times to get 10 "total randomly".
Beside that, offset is also not good if the random function return a high value. If your query looks like offset: 10000 and limit: 20 , it is generating 10,020 rows and throwing away the first 10,000 of them,
which is very expensive. So call 10 times offset.limit is not efficient.
So i thought that in case we just want to get one random user then User.offset(rand(User.count)).first maybe better (at least we can improve by caching User.count).
But if we want 10 random users or more then User.order("RAND()").limit(10) should be better.
Here's a quick solution.. currently using it with over 1.5 million records and getting decent performance. The best solution would be to cache one or more random record sets, and then refresh them with a background worker at a desired interval.
Created random_records_helper.rb file:
module RandomRecordsHelper
def random_user_ids(n)
user_ids = []
user_count = User.count
n.times{user_ids << rand(1..user_count)}
return user_ids
end
in the controller:
#users = User.where(id: random_user_ids(10))
This is much quicker than the .order("RANDOM()").limit(10) method - I went from a 13 sec load time down to 500ms.
I have a collection of users:
users = User.all()
I want to pass a subset of the user collection to a method.
Each subset should contain 1000 items (or less on the last iteration).
some_method(users)
So say users has 9500 items in it, I want to call some_method 10 times, 9 times passing 1000 items and the last time 500.
You can use Enumerable#each_slice method:
User.all.each_slice(1000) do |subarray|
some_method subarray
end
but that would first pull all the records from the database.
However, I guess you could make something like this:
def ar_each_slice scope, size
(scope.count.to_f / size).ceil.times do |i|
yield scope.scoped(:offset => i*size, :limit => size)
end
end
and use it as in:
ar_each_slice(User.scoped, 1000) do |slice|
some_method slice.all
end
It will first get the number of records (using COUNT), and then get 1000 by 1000 using LIMIT clause and pass it to your block.
Since Rails 2.3 one can specify batch_size:
User.find_in_batches(:batch_size =>1000) do |users|
some_method(users)
end
In this case, framework will run select query for every 1000 records. It keeps memory low if you are processing large number of records.
I think, you should divide into subset manually.
For example,
some_method(users[0..999])
I forgot about using :batch_size but Chandra suggested it. That's the right way to go.
Using .all will ask the database to retrieve all records, passing them to Ruby to hold then iterate over them internally. That is a really bad way to handle it if your database will be growing. That's because the glob of records will make the DBM work harder as it grows, and Ruby will have to allocate more and more space to hold them. Your response time will grow as a result.
A better solution is to use the :limit and :offset options to tell the DBM to successively find the first 1000 records at offset 0, then the next 1000 records at offset 1, etc. Keep looping until there are no more records.
You can determine how many times you'll have to loop by doing a .count before you begin asking, which is really fast unless your where-clause is beastly, or simply loop until you get no records back.