This simple code snippet output is in the range of 11 - 13 milli seconds. Now assuming for sake of question, that an increment of x is just a single instruction, the 2.3Ghz cpu of mind should take rougly a second to execute, since value of intMAX is close of 2 billion. Why is the answer in order of few milli's (11 - 13 millis) rather than in order of seconds (900 millis - 1100 millis) ??
long time1 = System.currentTimeMillis();
int x = 0;
while (x < Integer.MAX_VALUE) {
x++;
}
System.out.println(System.currentTimeMillis() - time1);
In theory - this can be optimized down to x=Integer.MAX_VALUE and then if x is not used after, completely removed
it is getting harder and harder to test how long things take since they get optimized out, especially if unused
try setting x to different start values and using x after the timer in a print or other calculation that leads to a print
Related
I run a mapreduce job on a hadoop cluster. The job's running time I saw in browser at master:8088 and master:19888 (job history server web UI) are shown below:
master:8088
master:19888
I have two questions:
Why are the elapsed times from two pictures different?
Why sometimes the Average Reduce Time is a negative number?
It looks like the Average Reduce Time is based on the times the previous tasks (shuffle/merge) took to finish and not necessarily the amount of time the reduce actually took to run.
Looking at this source code you can see the relevant calculations occurring around line 300.
if (attempt.getState() == TaskAttemptState.SUCCEEDED) {
numReduces++;
avgShuffleTime += (attempt.getShuffleFinishTime() - attempt.getLaunchTime());
avgMergeTime += attempt.getSortFinishTime() - attempt.getShuffleFinishTime();
avgReduceTime += (attempt.getFinishTime() - attempt.getSortFinishTime());
}
Followed by:
if (numReduces > 0) {
avgReduceTime = avgReduceTime / numReduces;
avgShuffleTime = avgShuffleTime / numReduces;
avgMergeTime = avgMergeTime / numReduces;
}
Looking at your numbers, they seem to be generally in-line with this approach to calculating the run times (everything converted to seconds):
Total Pre-reduce time = Map Run Time + Ave Shuffle + Ave Merge
143 = 43 + 83 + 17
Ave Reduce Time = Elapsed Time - Total Pre-reduce
-10 = 133 - 143
So looking at how long the Map, Shuffle and Merge took compared with the Elapsed we end up with a negative number close to your -8.
This is a partial answer, only for question 1!
I see a difference in "Submitted" and "Started" of 8 seconds in the second picture, while the time "Started" in the first picture is equal to the "Submitted" time of the second. I guess this covers the 8-second difference that you see as "Elapsed" time.
I am very curious for the second question as well, but it may not be a coincidence that it is also 8 seconds.
The following two code snippets perform the same task (generating M samples uniformly from an N-dim sphere). I was wondering why the latter one consumes much more time than the previous one.
%% MATLAB R2014a
M = 30;
N = 10000;
#1
tic
S = zeros(M, N);
for k = 1:M
P = ones(1, N);
for i = 1:N - 1
t = rand*2*pi;
P(1:i) = P(1:i)*sin(t);
P(i+1) = P(i+1)*cos(t);
end
S(k,:) = P;
end
toc
#2
tic
S = ones(M, N);
for k = 1:M
for i = 1:N - 1
t = rand*2*pi;
S(k, 1:i) = S(k, 1:i)*sin(t);
S(k, i+1) = S(k, i+1)*cos(t);
end
end
toc
The output is:
Elapsed time is 15.007667 seconds.
Elapsed time is 59.745311 seconds.
And I also tried M = 1,
Elapsed time is 0.463370 seconds.
Elapsed time is 1.566913 seconds.
#2 is nearly 4 times slower than #1. Is frequent 2d element accessing in #2 making it time-consuming?
The time difference is due to memory access patterns, and how well they map onto the cache. And also possibly to MATLAB's exploitation of your hardware vector unit (SSE/AVX). MATLAB stores matrices "column-major", meaning S(2,1) is next to S(1,1).
In #1, you process each sample using the vector P, which lives in contiguous memory. These 80,000 bytes fit easily in L2 cache for the fast repeated access you need to perform. They're also neighbors, and trivially vectorized (I'm not certain if MATLAB performs this optimization, but I'd hope so...)
In #2, you access a row of S at a time, which is not contiguous, but rather is interleaved by M values. So each row is spread across 30*80,000 bytes, which does not fit in L2 cache. It'll have to be read back in for each repeated access, even though you're ignoring 29/30 values in that data.
Here's the test. All I'm doing it transposing S so that you can process a column at a time instead, then putting it back at the end just to get the same result:
#3
tic
S = ones(N, M);
for k = 1:M
for i = 1:N - 1
t = rand*2*pi;
S(1:i, k) = S(1:i, k)*sin(t);
S(i+1, k) = S(i+1, k)*cos(t);
end
end
S = S.';
toc
Results:
Elapsed time is 11.254212 seconds.
Elapsed time is 45.847750 seconds.
Elapsed time is 11.501580 seconds.
Yep, transposing S gets us the same contiguous access and performance as the separate vector approach. By the way, L3 vs. L2 is about 4x more clock cycles... 1
Let's see if we can find any breakpoints related to cache size. Here's N = 1000, where everything should fit in L2:
Elapsed time is 0.240184 seconds.
Elapsed time is 0.373448 seconds.
Elapsed time is 0.258566 seconds.
Much lower difference, though now we're probably into L1 effects.
Finally, here's a completely different way to solve your problem. It relies on the fact that multivariate normal RV's have the correct symmetry.
#4
tic
S = randn(M, N);
S = bsxfun(#rdivide, S, sqrt(sum(S.*S, 2)));
toc
Elapsed time is 10.714104 seconds.
Elapsed time is 45.351277 seconds.
Elapsed time is 11.031061 seconds.
Elapsed time is 0.015068 seconds.
I suspect the advantage comes from using a hard coded 1 in the access of the array. If you try M=1 you will still see a significant speed up for the sin(t) line. My guess is that the assembly under the hood can do some use immediate instructions as opposed to reloading the variable K into a register.
I am running a for loop like so:
for var i: Float = 1.000; i > 0; i -= 0.005 {
println(i)
}
and I have found that after i has decreased past a certain value instead of decreasing by exactly 0.005, it decreases by ever so slightly less then 0.005, so that when it reaches the 201 iteration, i is not 0 but rather something infinitesimally close 0, and so the for loop runs. The output is as follows:
1.0
0.995
0.99
0.985
...
0.48
0.475001
0.470001
...
0.0100008 // should be 0.01
0.00500081 // should 0.005
8.12113e-07 // should be 0
My question is, first of all, why is this happening, and second of all what can I do so that i always decreases by 0.005 so that the loop does not run on the 201 iteration?
Thanks a lot,
bigelerow
The Swift Floating-Point Number documentation states:
Note
Double has a precision of at least 15 decimal digits, whereas the precision of Float can be as little as 6 decimal digits. The appropriate floating-point type to use depends on the nature and range of values you need to work with in your code. In situations where either type would be appropriate, Double is preferred.
In this case, it looks like the error is on the order of 4.060564999999999e-09 in each subtraction, based on the amount left over after 200 subtractions. Indeed changing Float to Double reduces the precision such that the loop runs until i = 0.00499999999999918 when it should be 0.005.
That is all well and good, however we still have the problem of construction a loop that will run until i becomes zero. If the amount that you reduce i by remains constant throughout the loop, one only slightly unfortunate work around is:
var x: Double = 1
let reduction = 0.005
for var i = Int(x/reduction); i >= 0; i -= 1, x = Double(i) * reduction {
println(x)
}
In this case your error won't compound since we are using an integer to index how many reductions we need to reach the current x, and thus is independent of the length of the loop.
Can anyone help me with some algorithm for this problem?
We have a big number (19 digits) and, in a loop, we subtract one of the digits of that number from the number itself.
We continue to do this until the number reaches zero. We want to calculate the minimum number of subtraction that makes a given number reach zero.
The algorithm must respond fast, for a 19 digits number (10^19), within two seconds. As an example, providing input of 36 will give 7:
1. 36 - 6 = 30
2. 30 - 3 = 27
3. 27 - 7 = 20
4. 20 - 2 = 18
5. 18 - 8 = 10
6. 10 - 1 = 9
7. 9 - 9 = 0
Thank you.
The minimum number of subtractions to reach zero makes this, I suspect, a very thorny problem, one that will require a great deal of backtracking potential solutions, making it possibly too expensive for your time limitations.
But the first thing you should do is a sanity check. Since the largest digit is a 9, a 19-digit number will require about 1018 subtractions to reach zero. Code up a simple program to continuously subtract 9 from 1019 until it becomes less than ten. If you can't do that within the two seconds, you're in trouble.
By way of example, the following program (a):
#include <stdio.h>
int main (int argc, char *argv[]) {
unsigned long long x = strtoull(argv[1], NULL, 10);
x /= 1000000000;
while (x > 9)
x -= 9;
return x;
}
when run with the argument 10000000000000000000 (1019), takes a second and a half clock time (and CPU time since it's all calculation) even at gcc insane optimisation level of -O3:
real 0m1.531s
user 0m1.528s
sys 0m0.000s
And that's with the one-billion divisor just before the while loop, meaning the full number of iterations would take about 48 years.
So a brute force method isn't going to help here, what you need is some serious mathematical analysis which probably means you should post a similar question over at https://math.stackexchange.com/ and let the math geniuses have a shot.
(a) If you're wondering why I'm getting the value from the user rather than using a constant of 10000000000000000000ULL, it's to prevent gcc from calculating it at compile time and turning it into something like:
mov $1, %eax
Ditto for the return x which will prevent it noticing I don't use the final value of x and hence optimise the loop out of existence altogether.
I don't have a solution that can solve 19 digit numbers in 2 seconds. Not even close. But I did implement a couple of algorithms (including a dynamic programming algorithm that solves for the optimum), and gained some insight that I believe is interesting.
Greedy Algorithm
As a baseline, I implemented a greedy algorithm that simply picks the largest digit in each step:
uint64_t countGreedy(uint64_t inputVal) {
uint64_t remVal = inputVal;
uint64_t nStep = 0;
while (remVal > 0) {
uint64_t digitVal = remVal;
uint_fast8_t maxDigit = 0;
while (digitVal > 0) {
uint64_t nextDigitVal = digitVal / 10;
uint_fast8_t digit = digitVal - nextDigitVal * 10;
if (digit > maxDigit) {
maxDigit = digit;
}
digitVal = nextDigitVal;
}
remVal -= maxDigit;
++nStep;
}
return nStep;
}
Dynamic Programming Algorithm
The idea for this is that we can calculate the optimum incrementally. For a given value, we pick a digit, which adds one step to the optimum number of steps for the value with the digit subtracted.
With the target function (optimum number of steps) for a given value named optSteps(val), and the digits of the value named d_i, the following relationship holds:
optSteps(val) = 1 + min(optSteps(val - d_i))
This can be implemented with a dynamic programming algorithm. Since d_i is at most 9, we only need the previous 9 values to build on. In my implementation, I keep a circular buffer of 10 values:
static uint64_t countDynamic(uint64_t inputVal) {
uint64_t minSteps[10] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
uint_fast8_t digit0 = 0;
for (uint64_t val = 10; val <= inputVal; ++val) {
digit0 = val % 10;
uint64_t digitVal = val;
uint64_t minPrevStep = 0;
bool prevStepSet = false;
while (digitVal > 0) {
uint64_t nextDigitVal = digitVal / 10;
uint_fast8_t digit = digitVal - nextDigitVal * 10;
if (digit > 0) {
uint64_t prevStep = 0;
if (digit > digit0) {
prevStep = minSteps[10 + digit0 - digit];
} else {
prevStep = minSteps[digit0 - digit];
}
if (!prevStepSet || prevStep < minPrevStep) {
minPrevStep = prevStep;
prevStepSet = true;
}
}
digitVal = nextDigitVal;
}
minSteps[digit0] = minPrevStep + 1;
}
return minSteps[digit0];
}
Comparison of Results
This may be considered a surprise: I ran both algorithms on all values up to 1,000,000. The results are absolutely identical. This suggests that the greedy algorithm actually calculates the optimum.
I don't have a formal proof that this is indeed true for all possible values. It intuitively kind of makes sense to me. If in any given step, you choose a smaller digit than the maximum, you compromise the immediate progress with the goal of getting into a more favorable situation that allows you to catch up and pass the greedy approach. But in all the scenarios I thought about, the situation after taking a sub-optimal step just does not get significantly more favorable. It might make the next step bigger, but that is at most enough to get even again.
Complexity
While both algorithms look linear in the size of the value, they also loop over all digits in the value. Since the number of digits corresponds to log(n), I believe the complexity is O(n * log(n)).
I think it's possible to make it linear by keeping counts of the frequency of each digit, and modifying them incrementally. But I doubt it would actually be faster. It requires more logic, and turns a loop over all digits in the value (which is in the range of 2-19 for the values we are looking at) into a fixed loop over 10 possible digits.
Runtimes
Not surprisingly, the greedy algorithm is faster to calculate a single value. For example, for value 1,000,000,000, the runtimes on my MacBook Pro are:
greedy: 3 seconds
dynamic: 36 seconds
On the other hand, the dynamic programming approach is obviously much faster at calculating all the values, since its incremental approach needs them as intermediate results anyway. For calculating all values from 10 to 1,000,000:
greedy: 19 minutes
dynamic: 0.03 seconds
As already shown in the runtimes above, the greedy algorithm gets about as high as 9 digit input values within the targeted runtime of 2 seconds. The implementations aren't really tuned, and it's certainly possible to squeeze out some more time, but it would be fractional improvements.
Ideas
As already explored in another answer, there's no chance of getting the result for 19 digit numbers in 2 seconds by subtracting digits one by one. Since we subtract at most 9 in each step, completing this for a value of 10^19 needs more than 10^18 steps. We mostly use computers that perform in the rough range of 10^9 operations/second, which suggests that it would take about 10^9 seconds.
Therefore, we need something that can take shortcuts. I can think of scenarios where that's possible, but haven't been able to generalize it to a full strategy so far.
For example, if your current value is 9999, you know that you can subtract 9 until you reach 9000. So you can calculate that you will make 112 steps ((9999 - 9000) / 9 + 1) where you subtract 9, which can be done in a few operations.
As said in comments already, and agreeing with #paxdiablo’s other answer, I’m not sure if there is an algorithm to find the ideal solution without some backtracking; and the size of the number and the time constraint might be tough as well.
A general consideration though: You might want to find a way to decide between always subtracting the highest digit (which will decrease your current number by the largest possible amount, obviously), and by looking at your current digits and subtracting which of those will give you the largest “new” digit.
Say, your current number only consists of digits between 0 and 5 – then you might be tempted to subtract the 5 to decrease your number by the highest possible value, and continue with the next step. If the last digit of your current number is 3 however, then you might want to subtract 4 instead – since that will give you 9 as new digit at the end of the number, instead of “only” 8 you would be getting if you subtracted 5.
Whereas if you have a 2 and two 9 in your digits already, and the last digit is a 1 – then you might want to subtract the 9 anyway, since you will be left with the second 9 in the result (at least in most cases; in some edge cases it might get obliterated from the result as well), so subtracting the 2 instead would not have the advantage of giving you a “high” 9 that you would otherwise not have in the next step, and would have the disadvantage of not lowering your number by as high an amount as subtracting the 9 would …
But every digit you subtract will not only affect the next step directly, but the following steps indirectly – so again, I doubt there is a way to always chose the ideal digit for the current step without any backtracking or similar measures.
I am facing an algorithm problem.
We have a task that runs every 10ms and during the running, an event can happen or not happen. Is there any simple algorithm that allows us to keep track of how many time an event is triggered within the latest, say, 1 second?
The only idea that I have is to implement an array and save all the events. As we are programming embedded systems, there is not enough space...
Thanks in advance.
an array of 13 bytes for a second worth of events in 10ms steps.
consider it an array of 104 bits marking 0ms to 104ms
if the event occurs mark the bit and increment to the next time, else just increment to next bit/byte.
if you want ... run length encode after each second to offload the event bits into another value.
or ... treat it as a circular buffer and keep the count available for query.
or both
You could reduce the array size to match the space available.
It is not clear if an event could occur multiple times while your task was running, or if it is always 10ms between events.
This is more-or-less what Dtyree and Weeble have suggested, but an example implementation may help ( C code for illustration):
#include <stdint.h>
#include <stdbool.h>
#define HISTORY_LENGTH 100 // 1 second when called every 10ms
int rollingcount( bool event )
{
static uint8_t event_history[(HISTORY_LENGTH+7) / 8] ;
static int next_history_bit = 0 ;
static int event_count = 0 ;
// Get history byte index and bit mask
int history_index = next_history_bit >> 3 ; // ">> 3" is same as "/ 8" but often faster
uint8_t history_mask = 1 << (next_history_bit & 0x7) ; // "& 0x07" is same as "% 8" but often faster
// Get current bit value
bool history_bit = (event_history[history_index] & history_mask) != 0 ;
// If oldest history event is not the same as new event, adjust count
if( history_bit != event )
{
if( event )
{
// Increment count for 0->1
event_count++ ;
// Replace oldest bit with 1
event_history[history_index] |= history_mask ;
}
else
{
// decrement count for 1->0
event_count-- ;
// Replace oldest bit with 0
event_history[history_index] &= ~history_mask ;
}
}
// increment to oldest history bit
next_history_bit++ ;
if( next_history_bit >= HISTORY_LENGTH ) // Could use "next_history_bit %= HISTORY_COUNT" here, but may be expensive of some processors
{
next_history_bit = 0 ;
}
return event_count ;
}
For a 100 sample history, it requires 13 bytes plus two integers of statically allocated memory, I have used int for generality, but in this case uint8_t counters would suffice. In addition there are three stack variables, and again the use of int is not necessary if you need to really optimise memory use. So in total it is possible to use as little as 15 bytes plus three bytes of stack. The event argument may or may not be passed on the stack, then there is the function call return address, but again that depends on the calling convention of your compiler/processor.
You need some kind of list/queue etc, but a ringbuffer has probably the best performance.
You need to store 100 counters (1 for each time period of 10 ms during the last second) and a current counter.
Ringbuffer solution:
(I used pseudo code).
Create a counter_array of 100 counters (initially filled with 0's).
int[100] counter_array;
current_counter = 0
During the 10 ms cycle:
counter_array[current_counter] = 0;
current_counter++;
For every event:
counter_array[current_counter]++
To check the number of events during the last s, take the sum of counter_array
Can you afford an array of 100 booleans? Perhaps as a bit field? As long as you can afford the space cost, you can track the number of events in constant time:
Store:
A counter C, initially 0.
The array of booleans B, of size equal to the number of intervals you want to track, i.e. 100, initially all false.
An index I, initially 0.
Each interval:
read the boolean at B[i], and decrement C if it's true.
set the boolean at B[i] to true if the event occurred in this interval, false otherwise.
Increment C if the event occurred in this interval.
When I reaches 100, reset it to 0.
That way you at least avoid scanning the whole array every interval.
EDIT - Okay, so you want to track events over the last 3 minutes (180s, 18000 intervals). Using the above algorithm and cramming the booleans into a bit-field, that requires total storage:
2 byte unsigned integer for C
2 byte unsigned integer for I
2250 byte bit-field for B
That's pretty much unavoidable if you require to have a precise count of the number of events in the last 180.0 seconds at all times. I don't think it would be hard to prove that you need all of that information to be able to give an accurate answer at all times. However, if you could live with knowing only the number of events in the last 180 +/- 2 seconds, you could instead reduce your time resolution. Here's a detailed example, expanding on my comment below.
The above algorithm generalizes:
Store:
A counter C, initially 0.
The array of counters B, of size equal to the number of intervals you want to track, i.e. 100, initially all 0.
An index I, initially 0.
Each interval:
read B[i], and decrement C by that amount.
write the number of events that occurred this interval into B[i].
Increment C by the number of events that occurred this interval.
When I reaches the length of B, reset it to 0.
If you switch your interval to 2s, then in that time 0-200 events might occur. So each counter in the array could be a one-byte unsigned integer. You would have 90 such intervals over 3 minutes, so your array would need 90 elements = 90 bytes.
If you switch your interval to 150ms, then in that time 0-15 events might occur. If you are pressed for space, you could cram this into a half-byte unsigned integer. You would have 1200 such intervals over 3 minutes, so your array would need 1200 elements = 600 bytes.
Will the following work for you application?
A rolling event counter that increments every event.
In the routine that runs every 10ms, you compare the current event counter value with the event counter value stored the last time the routine ran.
That tells you how many events occurred during the 10ms window.