Related
I am trying to write a simple script, where the input would be a start date, end date and a total amount of hours (150) and the script would generate a simple report containing random date-time intervals (with ideally weekdays) that would sum the entered amount of hours.
This is what I am trying to achieve:
Start: 2020-01-01
End: 2020-01-31
Total hours: 150
Report:
Jan 1, 2019, 08:02:20 – Jan 1, 2019, 08:55:00: sub time -> 52:40 (52 minutes 40 seconds)
Jan 1, 2019, 09:00:00 – Jan 1, 2019, 09:38:13: sub time -> 38:13 (38 minutes 13 seconds)
...
Jan 3, 2019, 13:15:00 – Jan 3, 2019, 14:45:13: sub time -> 01:30:13 (1 hour 30 minutes 13 seconds)
...
TOTAL TIME: 150 hours (or in minutes)
How do I generate time intervals where the total amount of minutes/hours would be equal to a given number of hours?
I assume the question is loosely-worded in the sense that "random" is not meant in a probability sense; that is, the intent is not to select a set of intervals (that total a given number of hours in length) with a mechanism that ensures all possible sets of such intervals have an equal likelihood of being selected. Rather, I understand that a set of intervals is to be chosen (e.g., for testing purposes) in a way that incorporates elements of randomness.
I have assumed the intervals are to be non-overlapping and the number of intervals is to be specified. I don't understand what "with ideally weekdays" means so I have disregarded that.
The heart of the approach I will propose is the following method.
def rnd_lengths(tot_secs, target_nbr)
max_secs = 2 * tot_secs/target_nbr - 1
arr = []
loop do
break(arr) if tot_secs.zero?
l = [(0.5 + max_secs * rand).round, tot_secs].min
arr << l
tot_secs -= l
end
end
The method generates an array of integers (lengths of intervals), measured in seconds, ideally having target_nbr elements. tot_secs is the required combined length of the "random" intervals (e.g., 150*3600).
Each element of the array is drawn randomly drawn from a uniform distribution that ranges from zero to max_secs (to be computed). This is done sequentially until tot_secs is reached. Should the last random value cause the total to exceed tot_secs it is reduced to make the total equal tot_secs.`
Suppose tot_secs equals 100 and we wish to generate 4 random intervals (target_nbr = 4). That means the average length of the intervals would be 25. As we are using a uniform distribution having an average of (1 + max_secs)/2, we may derive the value of max_secs from the expression
target_nbr * (1 + max_secs)/2 = tot_secs
which is
max_secs = 2 * tot_secs/target_nbr - 1
the first line of the method. For the example I mentioned, this would be
max_secs = 2 * 100/4 - 1
#=> 49
Let's try it.
rnd_lengths(100, 4)
#=> [49, 36, 15]
As you see the array that is returned sums to 100, as required, but it contains only 3 elements. That's why I named the argument target_nbr, as there is no assurance the array returned will have that number of elements. What to do? Try again!
rnd_lengths(100, 4)
#=> [14, 17, 26, 37, 6]
Still not 4 elements, so keep trying:
rnd_lengths(100, 4)
#=> [11, 37, 39, 13]
Success! It may take a few tries to get the correct number of elements, but for parameters likely to be used, and the nature of the probability distribution employed, I wouldn't expect that to be a problem.
Let's put this in a method.
def rdm_intervals(tot_secs, nbr_intervals)
loop do
arr = rnd_lengths(tot_secs, nbr_intervals)
break(arr) if arr.size == nbr_intervals
end
end
intervals = rdm_intervals(100, 4)
#=> [29, 26, 7, 38]
We can compute random gaps between intervals in the same way. Suppose the intervals fall within a range of 175 seconds (the number of seconds between the start time and end time). Then:
gaps = rdm_intervals(175-100, 5)
#=> [26, 5, 19, 4, 21]
As seen, the gaps sum to 75, as required. We can disregard the last element.
We can now form the intervals. The first interval begins at 26 seconds and ends at 26+29 #=> 55 seconds. The second interval begins at 55+5 #=> 60 seconds and ends at 60+26 #=> 86 seconds, and so on. We therefore find the intervals (each in ranges of seconds from zero) to be:
[26..55, 60..86, 105..112, 116..154]
Note that 175 - 154 = 21, the last element of gaps.
If one is uncomfortable with the fact that the last elements of intervals and gaps that are generally constrained in size one could of course randomly reposition those elements within their respective arrays.
One might not care if the number of intervals is exactly target_nbr. It would be simpler and faster to just use the first array of interval lengths produced. That's fine, but we still need the above methods to compute the random gaps, as their number must equal the number of intervals plus one:
gaps = rdm_intervals(175-100, intervals.size + 1)
We can now use these two methods to construct a method that will return the desired result. The argument tot_secs of this method equals total number of seconds spanned by the array intervals returned (e.g., 3600 * 150). The method returns an array containing nbr_intervals non-overlapping ranges of Time objects that fall between the given start and end dates.
require 'date'
def construct_intervals(start_date_str, end_date_str, tot_secs, nbr_intervals)
start_time = Date.strptime(start_date_str, '%Y-%m-%d').to_time
secs_in_period = Date.strptime(end_date_str, '%Y-%m-%d').to_time - start_time
intervals = rdm_intervals(tot_secs, nbr_intervals)
gaps = rdm_intervals(secs_in_period - tot_secs, nbr_intervals+1)
nbr_intervals.times.with_object([]) do |_,arr|
start_time += gaps.shift
end_time = start_time + intervals.shift
arr << (start_time..end_time)
start_time = end_time
end
end
See Date::strptime.
Let's try an example.
start_date_str = '2020-01-01'
end_date_str = '2020-01-31'
tot_secs = 3600*150
#=> 540000
construct_intervals(start_date_str, end_date_str, tot_secs, 4)
#=> [2020-01-06 18:05:04 -0800..2020-01-09 03:48:00 -0800,
# 2020-01-09 06:44:16 -0800..2020-01-11 23:33:44 -0800,
# 2020-01-20 20:30:21 -0800..2020-01-21 17:27:44 -0800,
# 2020-01-27 19:08:38 -0800..2020-01-28 01:38:51 -0800]
construct_intervals(start_date_str, end_date_str, tot_secs, 8)
#=> [2020-01-03 18:43:36 -0800..2020-01-04 10:49:14 -0800,
# 2020-01-08 07:55:44 -0800..2020-01-08 08:17:18 -0800,
# 2020-01-11 00:54:36 -0800..2020-01-11 23:00:53 -0800,
# 2020-01-14 05:20:14 -0800..2020-01-14 22:48:45 -0800,
# 2020-01-16 18:28:28 -0800..2020-01-17 22:50:24 -0800,
# 2020-01-22 02:59:31 -0800..2020-01-22 22:33:08 -0800,
# 2020-01-23 00:36:59 -0800..2020-01-24 12:15:37 -0800,
# 2020-01-29 11:22:21 -0800..2020-01-29 21:46:10 -0800]
See Date::strptime
START -xxx----xxx--x----xxxxx---xx--xx---xx-xx-x-xxx-- END
We need to fill a timespan with alternating periods of ON and OFF. This can be
denoted by a list of timestamps. Let's say that the period always starts with
an OFF period for simplicity's sake.
From the start/end of the timespan and the total seconds in ON state, we
gather useful facts:
the timespan's total size in seconds total_seconds
the second totals of both the ON (on_total_seconds) and the OFF (off_total_seconds) periods
Once we know these, a workable algorithm looks more or less like this - pardon
the functions without implementation:
# this can be a parameter as well
MIN_PERIODS = 10
MAX_PERIODS = 100
def fill_periods(start_date, end_date, on_total_seconds = 150*60*60)
total_seconds = get_total_seconds(start_date, end_date)
off_total_seconds = total_seconds - on_total_seconds
# establish two buckets to pull from alternately in populating our array of durations
on_bucket = on_total_seconds
off_bucket = off_total_seconds
result = []
# populate `result` with durations in seconds. `result` will sum to `total_seconds`
while on_bucket > 0 || off_bucket > 0 do
off_slice = rand(off_total_seconds / MAX_PERIODS / 2, off_total_seconds / MIN_PERIODS / 2).to_i
off_bucket -= [off_slice, off_bucket].min
on_slice = rand(on_total_seconds / MAX_PERIODS / 2, on_total_seconds / MIN_PERIODS / 2).to_i
on_bucket -= [on_slice, on_bucket].min
# randomness being random, we're going to hit 0 in one bucket before the
# other. when this happens, just add this (off, on) pair to the last one.
if off_slice == 0 || on_slice == 0
last_off, last_on = result.pop(2)
result << last_off + off_slice << last_on + on_slice
else
result << off_slice << on_slice
end
end
# build up an array of datetimes by progressively adding seconds to the last timestamp.
datetimes = result.each_with_object([start_date]) do |period, memo|
memo << add_seconds(memo.last, period)
end
# we want a list of datetime pairs denoting ON periods. since we know our
# timespan starts with OFF, we start our list of pairs with the second element.
datetimes.slice(1..-1).each_slice(2).to_a
end
This video covers an implementation of the min coins to make change.
https://en.wikipedia.org/wiki/Change-making_problem
The place I'm not clear on is where the interviewer goes into the details of optimization, starting from here.
https://youtu.be/HWW-jA6YjHk?t=1875
He suggests that to make the min number of coins, using denominations [25, 10, 1], we only need to use the algorithm to make change for numbers above 50 cents, after which we can safely just use 25 cents. So if the number was $100.10, we can use 25 cents till we hit 50 cents at which time we need to use the algorithm to compute the precise value.
This makes sense for the list of denominations give [25, 10, 1]. To get the breakpoint figure he suggests using LCM of the denominations which is 50 in this case.
For example
32 - 25 * 1 + 1 * 7 = 8 coins. But with 10 cents we can do
32 - 10 * 3 + 1 * 2 = 5 coins.
So we cannot just assume 25 cents is going to be included in the minimum number of coins calculation.
Here is my question --
Suppose we have denominations [25, 10, 5, 1], the lcm is still 50. But there is no min solution for any number over 25 cents has doesn't include the 25.
eg -
32 - 25 * 1 + 5 * 1 + 1 * 2 = 4 coins.
32 - 10 * 3 + 1 * 2 = 5 coins
So shouldn't the breakpoint be 25 cents in this case? Instead of the lcm?
Thanks for answering.
The LCM of the values provides a minimum upper bound on the "break point", that point at which we cannot blithely assume that the highest-denomination coin is part of the solution. A little number theory will prove that the LCM is a boundary.
50 is the LCM of {25, 10}. For any amount >= 50, any combination including at least 5*10 can replace that element by 2*25, reducing the coin count. This argument applies to all other coins and combinations thereof. This simple demonstration does not universally apply below the LCM; there will be amounts that serve as counterexamples.
To keep the overall algorithm easy to understand and maintain, we use only the two phases: largest coin above that breakpoint, and full DP solution below -- where, for most applications, even a brute-force solution is generally efficient enough for practical purposes.
They didn't say we can't use 25 when the input is lower than the break point. They suggested that a good optimisation can be to use the highest denomination until we reduce the number to the break point (because that is guaranteed to be the least number of coins needed for that portion) and then switch to the more resource-intensive algorithm to count the rest of the needed coins.
On a spinning disk, I have N records that I want to permute. In RAM, I have an array of N indices that contain the desired permutation. I also have enough RAM to hold n records at a time. What algorithm can I use to execute the permutation on disk as quickly as possible, taking into account the fact that sequential disk access is a lot faster?
I have plenty of excess disk to use for intermediate files, if desired.
This is a known problem. Find the cycles in your permutation order. For instance, given five records to permute [1, 0, 3, 4, 2], you have cycles (0, 1) and (2, 3, 4). You do this by picking an unused starting position; follow the index pointers until you return to your starting point. The sequence of pointers describes a cycle.
You then permute the records with an internal temporary variable, one record long.
temp = disk[0]
disk[0] = disk[1]
disk[1] = temp
temp = disk[2]
disk[2] = disk[3]
disk[3] = disk[4]
disk[4] = temp
Note that you can also perform the permutation as you traverse the pointers. You will also need some method to recall which positions have already been permuted, such as clearing the permutation index (set it to -1).
Can you see how to generalize that?
This is an problem with interval coordination. I'll simplify the notation slightly by changing the memory available to M records -- having upper- and lower-case N is a little confusing.
First, we re-cast the permutations as a series of intervals, the rotational span during which a record needs to reside in RAM. If a record needs to be written to a lower-numbered position, we increase the endpoint by the list size, to indicate the wraparound -- have to wait for the next disk rotation. For instance, using my earlier example, we expand the list:
[1, 0, 3, 4, 2]
0 -> 1
1 -> 0+5
2 -> 3
3 -> 4
4 -> 2+5
Now, we apply standard greedy scheduling resolution. First, sort by endpoint:
[0, 1]
[2, 3]
[3, 4]
[1, 5]
[4, 7]
Now, apply the algorithm for M-1 "lanes"; the extra one is needed for swap space. We fill each lane, appending the interval with the earliest endpoint, whose start-point doesn't overlap:
[0, 1] [2, 3] [3, 4] [4, 7]
[1, 5]
We can do this in a total of 7 "ticks" if M >= 3. If M=2, we defer the second lane by 2 rotations to [11, 15].
Sneftal's nice example gives us more troubles, with deeper overlap:
[0, 4]
[1, 5]
[2, 6]
[3, 7]
[4, 0+8]
[5, 1+8]
[6, 2+8]
[7, 3+8]
This requires 4 "lanes" if available, deferring lanes as needed if M < 5.
The pathological case is where every record in the permutation needs to be copied back one position, such as [3, 0, 1, 2], with M=2.
[0, 3]
[1, 4]
[2, 5]
[3, 6]
In this case, we walk through the deferral cycle multiple times. At the end of every rotation, we have to defer all remaining intervals by one rotation, resulting in
[0, 3] [3, 6] [2+4, 5+4] [1+4+4, 4+4+4]
Does that get you moving, or do you need more detail?
I have an idea, which might need further improvement. But here it goes:
suppose the hdd has the following structure:
5 4 1 2 3
And we want to write out this permutation:
2 3 5 1 4
Since hdd is a circular buffer, and assuming it can only rotate in one direction, we can write the above permutation using shifts as such:
5 >> 2
4 >> 3
1 >> 1
2 >> 2
3 >> 2
So let's put that in an array, and since we know it is a circular array, lets put its mirrors side by side:
| 2 3 1 2 2 | 2 3 1 2 2| 2 3 1 2 2 | 2 3 1 2 2 |... Inf
Since we want to favor sequential reads, (or writes) we can put a cost function to the above series. Let the cost function be linear, i. e:
0 1 2 3 4 5 6 7 8 9 10 ... Inf
Now, let us add the cost function to the above series, but how to select the starting point?
The idea is to select the starting point such that you get the maximum congruent monotonically increasing sequence.
For example, if you select the 0 point to be on "3", you'll get
(1) | - 3 2 4 5 | 6 8 7 9 10 | ...
If you select the 0 point to be on "2", the one just right of "1", you'll get:
(2) | - - - 2 3 | 4 6 5 7 8 | ...
Since we are trying to favor consecutive reads, lets define our read-write function to work as such:
f():
At any currently pointed hdd location, function will read the currently pointed hdd file, into available RAM. (namely, total space - 1, because we want to save 1 for swap)
If no available space is left on RAM for read, the function will assert and program will halt.
At any current hdd location, if ram holds the value that we want to be written in that hdd location, function reads the current file into swap space, writes the wanted value from the ram to hdd, and destroys the value in ram.
If a value is placed into hdd, function will check if the sequence is completed. If it is, program will return with success.
Now, we should note that if the following holds:
shift amount <= n - 1 (n : available memory we can hold)
We can traverse the hard disk in once pass using the above function. For example:
current: 4 5 6 7 0 1 2 3
we want: 0 1 2 3 4 5 6 7
n : 5
We can start anywhere we want, say from the initial "4". We read 4 items sequentially, (n has 4 items now) and we start placing from 0 1 2 3, (we can because n = 5 total, and 4 is used. 1 is used for swap). So the total operations is 4 consecutive reads, and then r-w operations for 8 times.
Using that analogy, it becomes clear that if we subtract "n-1" from equations (1) and (2), the positions which have value "<= 0" will be a better suit for initial position because the ones higher than zero will definitely require another pass.
So we select eq. (2) and subtract, for let's say "n = 3", we subtract 2 from eq. (2):
(2) | - - - 0 1 | 2 4 3 5 6 | ...
Now it is clear that, using f(), and starting from 0, assuming n = 3, we will have a starting operation as such: r, r, r-w, r-w, ...
So, how do we do the rest and find minimum cost? We will place an array with initial minimum cost, just below equation (2). The positions in that array will signify where we want f() to be executed.
| - - - 0 1 | 2 4 3 5 6 | ...
| - - - 1 1 | 1 1 1 1 1 | ...
The second array, the ones with 1's and 0's tell the program where to execute f(). Note that, if we assumed those locations wrong, f() will assert.
Before we start actually placing files into hdd, we of course want to see if the f() positions are correct. We check if there are assertions, we we will try to minimize cost whilst removing all assertions. So, e.g:
(1) 1111000000000000001111
(2) 1111111000000000000000
(1) obviously has higher cost that (2). So the question simplifies on finding the 1-0 array.
Some ideas on finding the best array:
Simplest solution is to write out all 1's and turn assertions into 0's. (essentially it's a skip). This method is guaranteed to work.
Brute force: write an array of as shown in (2) and start shifting 1's to right, in such an order that tries out every permutation available:
1111111100000000
1111111010000000
1111110110000000
...
Full random approach: Plug in mt1997 and start permuting. Whenever you see a sharp drop in cost, stop executing and implement hdd copy-paste. You won't find the global minimum, but you'll get a nice trade-off.
Genetic algorithms: For permutations where "shift count is much lower than n - 1", the methodology provided in this answer should (?) provide a global minimum and smooth gradients. This allows one to use genetic algorithms without relying on mutations too much.
One advantage I find in this approach is that, since OP mentioned that this is a real life problem, the method provides an easy(ier?) way to change cost functions. It is easier to detect the effect of say, having lots of contigous small files to be copied vs. having a single huge file. Or perhaps rrwwrrww is better than rrrrwwww?
Does any of this even make sense? We will have to try out ...
There is an SQL function with date as argument
f(p_date) = mod(to_char(p_date,'mm')+1,2)*39 + to_char(p_date,'dd')
The values of f(p_date) repeat themselves with a peroid of 2 months, i.e.
f(Feb 7th) = 46
f(Feb 8th) = 47
...
f(Apr 7th) = 46
...
f(Jun 7th) = 46
...
I don't catch a pattern here. Why is the multiplier equal to 39? Where do the 2 months come from?
What I need, is eventually same sort of function, but with a period of 40 days (or 1.5 months):
f(Feb 7th) = 46
..
f(Mar 19th) = 46
..
f(Apr 28th) = 46, etc
Thanks for any help.
Why is the multiplier equal to 39?
The modulo expression will evaluate to 0 for odd months and 1 for even months. This multiplied by 39 is either 0 or 39. Added the day, the function will return the day for odd months, and 39+day for even months.
Thus,
odd (january)
1, 2, 3, ..., last-of-month
even (february)
40, 41, 42, ... 39+last-of-month
Where do the 2 months come from?
The 2 is the argument of the modulus function (its divisor). The modulus function will return the sequence 1, 0, 1, 0, 1 ... for the input 1, 2, 3, 4, 5, ... and so on. Mathematically the remainder. It is used to create the odd/even periodicity.
#AlexeyKryuchkov, can you give more background about what you're trying to achieve and why?
1.5 months does not map to 40 days (or to any fixed number of days).
If you're trying to define a "40-day month", the easiest solution is to convert a date into an absolute day, then mod by 40.
I wrote a Q&A recently about the complexity of working with calendars: https://stackoverflow.com/a/48611348/9129668.
And adapting some of the code in that answer (which is based on SQL Server, not Oracle), the function you may be looking for would be something like:
((((DATEDIFF(DD, CONVERT(DATETIME2(0),'0001-01-01',102), p_date) + 1) - 1) % 40) + 1) AS day_of_40_day_mth
But if you give me a bit more explanation, I might be able to be more specific.
I have an intellectual curiosity that I would love your thoughts on. Don't necessarily need a whole solution; just want to get more eyes on it.
Given:
An array of integers (with the possibility of duplicates)
A range of "acceptable" integers to choose from.
Problem:
Weight the integers in r based on how "congested" they are in a. Two factors go into how "congested" an integer is:
How many times does it appear in the a? The more frequent, the more congested.
How many immediate neighbors does it have? The closer the neighbors, the more congested.
#1 weights much more heavily than #2 (how much? not sure; I just think it ought to be "a lot").
Example:
a = [1, 1, 2, 4, 6, 8, 8, 8, 8, 9, 10, 10]
r = (1..11)
Solution Idea:
Here's a quick (and dirty, definitely) solution that I came up with; seems to do the job:
$a = [1, 1, 2, 4, 6, 8, 8, 8, 8, 9, 10, 10]
$r = (1..11)
def how_congested?(integer)
((10 * $a.count(integer) + 2.5 * number_of_neighbors(integer))/100)
end
def number_of_neighbors(integer)
count = 0
hash = Hash[$a.uniq.map.with_index.to_a]
index = hash[integer]
count += 1 unless hash[integer + 1].nil?
count += 1 unless hash[integer - 1].nil?
count
end
$r.each do |i|
puts "Congestion of ##{ i }: #{ how_congested?(i) }"
end
# Congestion of #1: 0.225
# Congestion of #2: 0.125
# Congestion of #3: 0.05
# Congestion of #4: 0.1
# Congestion of #5: 0.05
# Congestion of #6: 0.1
# Congestion of #7: 0.05
# Congestion of #8: 0.425
# Congestion of #9: 0.15
# Congestion of #10: 0.225
# Congestion of #11: 0.025
Issue:
This takes into account immediate neighbors, but not neighbors 2 spots away, 3 spots away, etc. I think there should be some sort of sliding scale (e.g., "next-door" neighbors count 2x as much as neighbors 2 spots away, etc.).
I came up with this "algorithm" on a napkin, but I'm wondering if there's a more intelligent way to do it?
Appreciate your thoughts!
Check this out:
class Congestion
attr_accessor :array, :range
def initialize(array, range)
#array = array
#range = range
end
def how_congested?(integer)
((10 * self.array.count(integer) + 2.5 * weight_of_neighbors(integer)) / 100)
end
def weight_of_neighbors(integer)
weight = 0
#array.uniq.each do |elem|
weight += case (elem - integer).abs
when 1 then 3
when 2 then 2
when 3 then 1.5
when 4 then 1.25
when 5 then 1
else 0
end
end
weight
end
def calculate
self.range.each do |i|
congestion = how_congested?(i)
puts "Congestion of #{i}: #{congestion}"
end
end
end
a = [1, 1, 2, 4, 6, 8, 8, 8, 8, 9, 10, 10]
r = (1..11)
c = Congestion.new(a, r)
c.calculate
Which ends up looking like this:
# Congestion of 1: 0.3375
# Congestion of 2: 0.25625
# Congestion of 3: 0.2625
# Congestion of 4: 0.29375
# Congestion of 5: 0.3125
# Congestion of 6: 0.325
# Congestion of 7: 0.3
# Congestion of 8: 0.60625
# Congestion of 9: 0.3125
# Congestion of 10: 0.35625
# Congestion of 11: 0.1875
Basically the relevant change here is that it takes the integer we are interested in, subtracts it from the current element of the array, then gets the positive version of that number.