I couldn't find any information on actual performance differences between loops in Haxe. They mentioned that Vector has some speed optimizations due to it being fixed length. What is the best way to loop over objects? And does it depend on the iterable object (e.g. Array vs. Vector vs. Map)?
Why does Haxe have so little presence on SO? Every other language has this question answered 5 times over...
Since no one had done performance benchmarks that I have found, I decided to run a test so this info is available to future Haxe programmers.
First note: If you aren't running over the loop very often, it is so fast it has almost no effect on performance. So if it is easier to just use an Array, do it. Performance is only affected if you are running over the thing over and over and/or if it is really big.
Turns out that your best choice depends on mostly your data structure. I found that Arrays tend to be faster when you do the for each style loop instead of the standard for loop or while loop. At small sizes Arrays are essentially as fast as Vectors, so in most cases you don't need to worry about which one to use. However, if you are doing stuff with pretty massive Arrays, it would be very beneficial to switch over to a Vector. And if you use a Vector, using the standard for or a while loop is essentially equivalent (although while is a touch faster). Maps are also pretty fast, especially if you avoid the foreach loop.
To reach these conclusions, I first tested the loops under these conditions:
Tested Array, Vector and Map (Map just for fun).
Filled each one to be structure[i] = i where i in 0...size with sizes in [20, 100, 1000, 10000, 100000] so you can find the right size for you.
Tested each data structure on each size using three for loop types
for (i in 0...size)
for (item in array)
while (i < size)
and inside each loop I performed a look up and assignment arr[i] = arr[i] + 1;
Each loop type was inside its own loop for (iter in 0...1000) to get a more accurate read on how loops perform. Note that I just add together the times for each loop, I don't average or anything like that. So if an array took 12 seconds, it was really 12 / 1000 => 0.012 seconds to execute once on average.
Finally, here is my benchmark (run in Debug for neko in HaxeDevelop):
Running test on size 20:
for (i...20) x 1000
Array : 0.0019989013671875
Vector : 0
Map : 0.00300025939941406
for each(i in iterable) x 1000
Array : 0.00100135803222656
Vector : 0.00099945068359375
Map : 0.0209999084472656
while (i < 20) x 1000
Array : 0.00200080871582031
Vector : 0.00099945068359375
Map : 0.0019989013671875
Running test on size 100:
for (i...100) x 1000
Array : 0.0120010375976563
Vector : 0.0019989013671875
Map : 0.0120010375976563
for each(i in iterable) x 1000
Array : 0.00600051879882813
Vector : 0.00299835205078125
Map : 0.0190010070800781
while (i < 100) x 1000
Array : 0.0119991302490234
Vector : 0.00200080871582031
Map : 0.0119991302490234
Running test on size 1000:
for (i...1000) x 1000
Array : 0.11400032043457
Vector : 0.0179996490478516
Map : 0.104999542236328
for each(i in iterable) x 1000
Array : 0.0550003051757813
Vector : 0.0229988098144531
Map : 0.210000991821289
while (i < 1000) x 1000
Array : 0.105998992919922
Vector : 0.0170001983642578
Map : 0.101999282836914
Running test on size 10000:
for (i...10000) x 1000
Array : 1.09500122070313
Vector : 0.180000305175781
Map : 1.09700012207031
for each(i in iterable) x 1000
Array : 0.553998947143555
Vector : 0.222999572753906
Map : 2.17600059509277
while (i < 10000) x 1000
Array : 1.07900047302246
Vector : 0.170999526977539
Map : 1.0620002746582
Running test on size 100000:
for (i...100000) x 1000
Array : 10.9670009613037
Vector : 1.80499839782715
Map : 11.0330009460449
for each(i in iterable) x 1000
Array : 5.54100036621094
Vector : 2.21299934387207
Map : 20.4000015258789
while (i < 100000) x 1000
Array : 10.7889995574951
Vector : 1.71500015258789
Map : 10.8209991455078
total time: 83.8239994049072
Hope that helps anyone who is worried about performance and Haxe and who needs to use a lot of loops.
Related
I'm trying to speed up steps 1-4 in the following code (the rest is setup that will be predetermined for my actual problem.)
% Given sizes:
m = 200;
n = 1e8;
% Given vectors:
value_vector = rand(m, 1);
index_vector = randi([0 200], n, 1);
% Objective: Determine the values for the values_grid based on indices provided by index_grid, which
% correspond to the indices of the value in value_vector
% 0. Preallocate
values = zeros(n, 1);
% 1. Remove "0" indices since these won't have values assigned
nonzero_inds = (index_vector ~= 0);
% 2. Examine only nonzero indices
value_inds = index_vector(nonzero_inds);
% 3. Get the values for these indices
nonzero_values = value_vector(value_inds);
% 4. Assign values to output (0 for those with 0 index)
values(nonzero_inds) = nonzero_values;
Here's my analysis of these portions of the code:
Necessary since the index_vector will contain zeros which need to be ferreted out. O(n) since it's just a matter of going through the vector one element at a time and checking (value ∨ 0)
Should be O(n) to go through index_vector and retain those that are nonzero from the previous step
Should be O(n) since we have to check each nonzero index_vector element, and for each element we access the value_vector which is O(1).
Should be O(n) to go through each element of nonzero_inds, access corresponding values index, access the corresponding nonzero_values element, and assign it to the values vector.
The code above takes about 5 seconds to run through steps 1-4 on 4 cores, 3.8GHz. Do you all have any ideas on how this could be sped up? Thanks.
Wow, I found something really interesting. I saw this link in the "related" section about indexing vectors being inefficient in Matlab sometimes, so I decided to try a for loop. This code ended up being an order of magnitude faster!
for i = 1:n
if index_vector(i) > 0
values(i) = value_vector(index_vector(i));
end
end
EDIT: Another interesting thing, unfortunately detrimental to my problem though. The speed of this solution depends on the amount of zeros in the index_vector. With index_vector = randi([0 200]);, a small proportion of the values are zeros, but if I try index_vector = randi([0 1]), approximately half of the values will be zero and then the above for loop is actually an order of magnitude slower. However, using ~= instead of > speeds the loop back up so that it's on a similar order of magnitude. Very interesting and odd behavior.
if you stick to matlab and the flow of the algorithm you want , and not doing this in fortran or c, here's a small start:
change the randi to rand, and round by casting to uint8 and use the > logical operation that for some reason is faster at my end
to sum up:
value_vector = rand(m, 1 );
index_vector = uint8(-0.5+201*rand(n,1) );
values = zeros(n, 1);
values=value_vector(index_vector(index_vector>0));
this improved at my end by a factor 1.6
I am writing some data on a bitmap file, and I have this loop to calculate the data which runs for 480,000 times according to each pixel in 800 * 600 resolution, hence different arguments (coordinates) and different return value at each iteration which is then stored in an array of size 480,000. This array is then used for further calculation of colours.
All these iterations combined take a lot of time, around a minute at runtime in Visual Studio (for different values at each execution). How can I ensure that the time is greatly reduced? It's really stressing me out.
Is it the fault of my machine (i5 9th gen, 8GB RAM)? Visual Studio 2019? Or the algorithm entirely? If it's the algorithm, what can I do to reduce its time?
Here's the loop that runs for each individual iteration:
int getIterations(double x, double y) //x and y are coordinates
{
complex<double> z = 0; //These are complex numbers, imagine a pair<double>
complex<double> c(x, y);
int iterations = 0;
while (iterations < max_iterations) // max_iterations has to be 1000 to get decent image quality
{
z = z * z + c;
if (abs(z) > 2) // abs(z) = square root of the sum of squares of both elements in the pair
{
break;
}
iterations++;
}
return iterations;
}
While I don't know how exactly your abs(z) works, but based on your description, it might be slowing down your program by a lot.
Based on your description, your are taking the sum of squares of both element of your complex number, then get a square root out of it. Whatever your methods of square root is, it probably takes more than just a few lines of codes to run.
Instead, just compare complex.x * complex.x + complex.y * complex.y > 4, it's definitely faster than getting the square root first, then compare it with 2
There's a reason the above should be done during run-time?
I mean: the result of this loop seems dependant only on "x" and "y" (which are only coordinates), thus you can try to constexpr-ess all these calculation to be done at compile-time to pre-made a map of results...
At least, just try to build that map once during run-time initialisation.
Im trying to construct and compare, the fastest possible way, two 01 random vectors of the same length using Julia, each vector with the same number of zeros and ones.
This is all for a MonteCarlo simulation of the following probabilistic question
We have two independent urns, each one with n white balls and n black balls. Then we take a pair of balls, one of each urn, each time up to empty the urns. What is the probability that each pair have the same color?
What I did is the following:
using Random
# Auxiliar function that compare the parity, element by element, of two
# random vectors of length 2n
function comp(n::Int64)
sum((shuffle!(Vector(1:2*n)) .+ shuffle!(Vector(1:2*n))).%2)
end
The above generate two random permutations of the vector from 1 to 2n, add element by element, apply modulo 2 to each elemnt and after sum all the values of the remaining vector. Then Im using above the parity of each number to model it color: odd black and white even.
If the final sum is zero then the two random vectors had the same colors, element by element. A different result says that the two vectors doesnt had paired colors.
Then I setup the following function, that it is just the MonteCarlo simulation of the desired probability:
# Here m is an optional argument that control the amount of random
# experiments in the simulation
function sim(n::Int64,m::Int64=24)
# A counter for the valid cases
x = 0
for i in 1:2^m
# A random pair of vectors is a valid case if they have the
# the same parity element by element so
if comp(n) == 0
x += 1
end
end
# The estimated value
x/2^m
end
Now I want to know if there is a faster way to compare such vectors. I tried the following alternative construction and comparison for the random vectors
shuffle!( repeat([0,1],n)) == shuffle!( repeat([0,1],n))
Then I changed accordingly the code to
comp(n)
With these changes the code runs slightly slower, what I tested with the function #time. Other changes that I did was changing the forstatement for a whilestatement, but the computation time remain the same.
Because Im not programmer (indeed just yesterday I learn something of the Julia language, and installed the Juno front-end) then probably will be a faster way to make the same computations. Some tip will be appreciated because the effectiveness of a MonteCarlo simulation depends on the number of random experiments, so the faster the computation the larger values we can test.
The key cost in this problem is shuffle! therefore in order to maximize the simulation speed you can use (I add it as an answer as it is too long for a comment):
function test(n,m)
ref = [isodd(i) for i in 1:2n]
sum(all(view(shuffle!(ref), 1:n)) for i in 1:m) / m
end
What are the differences from the code proposed in the other answer:
You do not have to shuffle! both vectors; it is enough to shuffle! one of them, as the result of the comparison is invariant to any identical permutation of both vectors after independently shuffling them; therefore we can assume that one vector is after random permutation reshuffled to be ordered so that it has trues in the first n entries and falses in the last n entries
I do shuffle! in-place (i.e. ref vector is allocated only once)
I use all function on the fist half of the vector; this way the check is stopped as I hit first false; if I hit all true in the first n entries I do not have to check the last n entries as I know they are all false so I do not have to check them
To get something cleaner, you could generate directly vectors of 0/1 values, and then just let Julia check for vector equality, e.g.
function rndvec(n::Int64)
shuffle!(vcat(zeros(Bool,n),ones(Bool,n)))
end
function sim0(n::Int64, m::Int64=24)
sum(rndvec(n) == rndvec(n) for i in 1:2^m) / 2^m
end
Avoiding allocation makes the code faster, as explained by Bogumił Kamiński (and letting Julia make the comparison is faster than his code).
function sim1(n::Int64, m::Int64=24)
vref = vcat(zeros(Bool,n),ones(Bool,n))
vshuffled = vref[:]
sum(shuffle!(vshuffled) == vref for i in 1:2^m) / 2^m
end
To go even faster use lazy evaluation and fast exit: if the first element is different, you don't even need to generate the rest of the vectors.
This would make the code much trickier though.
I find it's a bit not in the spirit of the question, but you could also do some more math.
There is binomial(2*n, n) possible vectors generated and you could therefore just compute
function sim2(n::Int64, m::Int64=24)
nvec = binomial(2*n, n)
sum(rand(1:nvec) == 1 for i in 1:2^m) / 2^m
end
Here are some timings I obtain:
#time show(("sim0", sim0(6, 21)))
#time show(("sim1", sim1(6, 21)))
#time show(("sim2", sim2(6, 21)))
#time test(("test", test(6, 2^21)))
("sim0", 0.0010724067687988281) 4.112159 seconds (12.68 M allocations: 1.131 GiB, 11.47% gc time)
("sim1", 0.0010781288146972656) 0.916075 seconds (19.87 k allocations: 1.092 MiB)
("sim2", 0.0010628700256347656) 0.249432 seconds (23.12 k allocations: 1.258 MiB)
("test", 0.0010166168212890625) 1.180781 seconds (2.14 M allocations: 98.634 MiB, 2.22% gc time)
I want to sort a huge array, say 10^8 entries of type X with at most N different keys, where N is ~10^2. Because I don't know the range or spacing of the elements, count sort is not an option. So my best guess so far is to use a hash map for the counts like so
std::unordered_map< X, unsigned > counts;
for (auto x : input)
counts[x]++;
This works ok-ish and is ~4 times faster than 3-way quicksort, but I'm a nervous person and it's still not fast enough.
I wonder: am I missing something? Can I make better use of the fact that N is known in advance? Or is it possible to tune the hash map to my needs?
EDIT An additional pre-condition is that the input sequence is badly sorted and the frequency of the keys is about the same.
STL implementations are often not perfect in terms of performance (no holy wars, please).
If you know a guaranteed and sensible upper on the number of unique elements (N), then you can trivially implement your own hash table of size 2^s >> N. Here is how I usually do it myself:
int size = 1;
while (size < 3 * N) size <<= 1;
//Note: at least 3X size factor, size = power of two
//count = -1 means empty entry
std::vector<std::pair<X, int>> table(size, make_pair(X(), -1));
auto GetHash = [size](X val) -> int { return std::hash<X>()(val) & (size-1); };
for (auto x : input) {
int cell = GetHash(x);
bool ok = false;
for (; table[cell].second >= 0; cell = (cell + 1) & (size-1)) {
if (table[cell].first == x) { //match found -> stop
ok = true;
break;
}
}
if (!ok) { //match not found -> add entry on free place
table[cell].first = x;
table[cell].second = 0;
}
table[cell].second++; //increment counter
}
On MSVC2013, it improves time from 0.62 secs to 0.52 secs compared to your code, given that int is used as type X.
Also, we can choose a faster hash function. Note however, that the choice of hash function depends heavily on the properties of the input. Let's take Knuth's multiplicative hash:
auto GetHash = [size](X val) -> int { return (val*2654435761) & (size-1); };
It further improves time to 0.34 secs.
As a conclusion: do you really want to reimplement standard data structures to achieve a 2X speed boost?
Notes: Speedup may be entirely different on another compiler/machine. You may have to do some hacks if your type X is not POD.
Counting sort really would by best, but isnt applicable due to unknown range or spacing.
Seems to be easily parallelized with fork-join, e.g. boost::thread.
You could also try a more efficient, handrolled hashmap. Unorded_map typically uses linked lists to counter potentially bad hash functions. The memory overhead of linked lists may hurt performance if the hashtable doesnt fit into L1 cache. Closed Hashing may use less memory. Some hints for optimizing:
Closed Hashing with linear probing and without support for removal
power of two sized hashtable for bit shifting instead of modulo (division requires multiple cycles and there is only one hardware divider per core)
Low LoadFactor (entries through size) to minimize collisions. Thats a tradeof between memory usage and number of collisions. A LoadFactor over 0.5 should be avoided. A hashtable-size of 256 seems suitable for 100 entries.
cheapo hash function. You havent shown the type of X, so perhaps a cheaper hash function could outweigh more collisions.
I would look to store items in a sorted vector, as about 100 keys, would mean inserting into the vector would only occur 1 in 10^6 entries. Lookup would be processor efficient bsearch in vector
I want to detect parabola(s) of type : y^2 = 4a*x in an image[size: 512 X 512]. I prepared an accumulator array, acc[size: 512 X 512 X 512]. I prepared a MATRIX corresponding to that image. I used hough-transform. This is how I did it:
for x = 1 to 512
for y= 1 to 512
if image_matrix(x,y)> 245//almost white value, so probable to be in parabola
{
for x1= 1 to 512
for y1= 1 to 512
{
calculate 'a' from (y-y1)^2 = 4*a*(x-x1).
increment acc(i,j,k) by 1
}
}
if acc(i,j,k) has a maximum value.
{
x1=i, y1=j,a =k
}
I faced following problems:
1) acc[512][512][512] takes large memory. It needs huge computation.How can I decrease array size and thus minimize computation?
2) Not always max valued-entry of acc(i,j,k) give intended output. Sometimes second or third maximum, and even 10'th maximum value give the intended output. I need approx. value of 'a', 'x1','y1'(not exact value).
Please help me. Is there any wrong in my concept?
What i'm going to say may only partly answer your question, but it should work.
If you want to find these type of parabolas
y^2 = 4a*x
Then they are parametrized by only one parameter which is 'a'. Therefore, i don't really understand why you use a accumulator of 3 dimensions.
For sure, if you want to find a parabola with a more general equation like :
y = ax^2 + bx + c
or in the y direction by replacing x by y, you will need a 3-dimension accumulator like in your example.
I think in your case the problem could be solved easily, saying you only need one accumulator (as you have only one parameter to accumulate : a)
That's what i would suggest :
for every point (x,y) of your image (x=0 exclusive) {
calculate (a = y^2 / 4x )
add + 1 in the corresponding 'a' cell of your accumulator
(eg: a = index of a simple table)
}
for all the cells of your accumulator {
if (cell[idx] > a certain threshold) there is a certain parabola with a = idx
}
I hope it can help you,
This is as well an interesting thing to look at :
Julien,