Recently, I was intrigued by the Julia-lang as it claims to be a dynamic language that has near C performance. However, my experience with it so far is not good (at least performance wise).
The application that I'm writing requires random-access to specific array indices and then comparing their values with other specific array indices (over many iterations). The following code simulates my needs from the program:
My Julia code finishes executing in around 8seconds while the java-script code requires less than 1second on chrome environment!
Am I doing something wrong with the Julia code? Thanks a lot in advance.
Julia code here:
n=5000;
x=rand(n)
y=rand(n)
mn=ones(n)*1000;
tic();
for i in 1:n;
for j in 1:n;
c=abs(x[j]-y[i]);
if(c<mn[i])
mn[i]=c;
end
end
end
toc();
Javascript code: (>8 times faster than the julia code above!)
n=5000; x=[]; y=[]; mn=[];
for(var i=0; i<n; i++){x.push(Math.random(1))}
for(var i=0; i<n; i++){y.push(Math.random(1))}
for(var i=0; i<n; i++){mn.push(1000)}
console.time('test');
for(var i=0; i<n; i++){
for(var j=0; j<n; j++){
c=Math.abs(x[j]-y[i]);
if(c<mn[i]){
mn[i]=c;
}
}
}
console.timeEnd('test');
Performance Tips
Avoid global variables
A global variable might have its value, and therefore its type, change
at any point. This makes it difficult for the compiler to optimize
code using global variables. Variables should be local, or passed as
arguments to functions, whenever possible.
Any code that is performance critical or being benchmarked should be
inside a function.
We find that global names are frequently constants, and declaring them
as such greatly improves performance:
julia> const n = 5000; const x, y = rand(n), rand(n); const mn = fill(1000.0, n);
julia> function foo!(mn, x, y)
n = length(mn)
#inbounds for i in 1:n, j in 1:n
c = abs(x[j] - y[i])
if(c < mn[i])
mn[i] = c
end
end
return mn
end
foo! (generic function with 1 method)
julia> using BenchmarkTools: #btime
julia> #btime foo!(mn, x, y)
15.432 ms (0 allocations: 0 bytes)
Julia code should always be inside a function, so that the compiler can optimize it. Also, you are measuring both the compile time and the execution time: to get an accurate measure, you should call the function twice (the first for compilation).
function test(n)
x=rand(n);
y=rand(n);
mn=ones(n)*1000;
for i in 1:n;
for j in 1:n;
c=abs(x[j]-y[i])
if(c<mn[i])
mn[i]=c
end
end
end
(mn, x, y)
end
test(1)
#time test(5000);
This is taking 0.04s on my laptop. javascript in chromium 1s
(javascript in firefox web console 53s)
Related
So, I'm trying to create a random vector (think geometry, not an expandable array), and every time I call my random vector function I get the same x value, though y and z are different.
int main () {
srand ( (unsigned)time(NULL));
Vector<double> a;
a.randvec();
cout << a << endl;
return 0;
}
using the function
//random Vector
template <class T>
void Vector<T>::randvec()
{
const int min=-10, max=10;
int randx, randy, randz;
const int bucket_size = RAND_MAX/(max-min);
do randx = (rand()/bucket_size)+min;
while (randx <= min && randx >= max);
x = randx;
do randy = (rand()/bucket_size)+min;
while (randy <= min && randy >= max);
y = randy;
do randz = (rand()/bucket_size)+min;
while (randz <= min && randz >= max);
z = randz;
}
For some reason, randx will consistently return 8, whereas the other numbers seem to be following the (pseudo) randomness perfectly. However, if I put the call to define, say, randy before randx, randy will always return 8.
Why is my first random number always 8? Am I seeding incorrectly?
The issue is that the random number generator is being seeded with a values that are very close together - each run of the program only changes the return value of time() by a small amount - maybe 1 second, maybe even none! The rather poor standard random number generator then uses these similar seed values to generate apparently identical initial random numbers. Basically, you need a better initial seed generator than time() and a better random number generator than rand().
The actual looping algorithm used is I think lifted from Accelerated C++ and is intended to produce a better spread of numbers over the required range than say using the mod operator would. But it can't compensate for always being (effectively) given the same seed.
I don't see any problem with your srand(), and when I tried running extremely similar code, I did not repeatedly get the same number with the first rand(). However, I did notice another possible issue.
do randx = (rand()/bucket_size)+min;
while (randx <= min && randx >= max);
This line probably does not do what you intended. As long as min < max (and it always should be), it's impossible for randx to be both less than or equal to min and greater than or equal to max. Plus, you don't need to loop at all. Instead, you can get a value in between min and max using:
randx = rand() % (max - min) + min;
I had the same problem exactly. I fixed it by moving the srand() call so it was only called once in my program (previously I had been seeding it at the top of a function call).
Don't really understand the technicalities - but it was problem solved.
Also to mention, you can even get rid of that strange bucket_size variable and use the following method to generate numbers from a to b inclusively:
srand ((unsigned)time(NULL));
const int a = -1;
const int b = 1;
int x = rand() % ((b - a) + 1) + a;
int y = rand() % ((b - a) + 1) + a;
int z = rand() % ((b - a) + 1) + a;
A simple quickfix is to call rand a few times after seeding.
int main ()
{
srand ( (unsigned)time(NULL));
rand(); rand(); rand();
Vector<double> a;
a.randvec();
cout << a << endl;
return 0;
}
Just to explain better, the first call to rand() in four sequential runs of a test program gave the following output:
27592
27595
27598
27602
Notice how similar they are? For example, if you divide rand() by 100, you will get the same number 3 times in a row. Now take a look at the second result of rand() in four sequential runs:
11520
22268
248
10997
This looks much better, doesn't it? I really don't see any reason for the downvotes.
Your implementation, through integer division, ignores the smallest 4-5 bit of the random number. Since your RNG is seeded with the system time, the first value you get out of it will change only (on average) every 20 seconds.
This should work:
randx = (min) + (int) ((max - min) * rand() / (RAND_MAX + 1.0));
where
rand() / (RAND_MAX + 1.0)
is a random double value in [0, 1) and the rest is just shifting it around.
Not directly related to the code in this question, but I had same issue with using
srand ((unsigned)time(NULL)) and still having same sequence of values being returned from following calls to rand().
It turned out that srand needs to called on each thread you are using it on separately. I had a loading thread that was generating random content (that wasn't random cuz of the seed issue). I had just using srand in the main thread and not the loading thread. So added another srand ((unsigned)time(NULL)) to start of loading thread fixed this issue.
I am trying to parallelize a code that looks like this using OpenCL run on a GPU:
int pos = 0;
for (int x=0 ; x<x_len ; x++){
//do some expensive calculation like
int value = pow(24, x);
for (int y=0 ; y<y_len ; y++){
for (int z=0 ; z<z_len ; z++){
//do some expensive calculation that depends on value
//but is very unlikely to have a positive result needed
//to save in global memmoy - something like
if ((value * y * z) % 456 = 0){
// save position that had positive result
output[pos] = x*y_len*z_len+y*z_len + z;
pos++;
}
}
}
}
(y_len and z_len are small enough so that a (1,y_len,z_len) lokal workgroup is possible, just in case that is important)
My current solution has 2 kernels, one calculating the outer calculation and saving the result to global memory and the second using that calculated data and calculate the inner calculation (using atomic_add for pos). That works fine but my actual code needs more data to be saved than just one integer (it is actually 2 integer and 2 longs) per x iteration, so that global memory is used up quite fast. That means that I need to split the kernel calls and iterate over the calls in the host code a lot of times.
So my question is, if there a better ways to parallelize that code?
I'm using OMP to try to get some speedup in a small kernel. It's basically just querying a vector of unordered_sets for membership. I tried to make an optimization, but surprisingly I got a slowdown, and am really curious why.
My first pass was:
vector<unordered_set<uint16_t> > setList = getData();
#pragma omp parallel for default(shared) private(i, j) schedule(dynamic, 50)
for(i = 0; i < size; i++){
for(j = 0; j < 500; j++){
count = count + setList[i].count(val[j]);
}
}
Then I thought I could maybe get a speedup by moving the setList[i] sub expression up one level of nesting and save it in a temp variable, by doing the following:
#pragma omp parallel for default(shared) private(i, j, currSet) schedule(dynamic, 50)
for(i = 0; i < size; i++){
currSet = setList[i];
for(j = 0; j < 500; j++){
count = count + currSet.count(val[j]);
}
}
I had thought this would maybe save a load each iteration of the "j" for loop, and get a speedup, but it actually SLOWED DOWN by about 3x. By this I mean the entire kernel took about 3 times as long to run. Thoughts on why this would occur?
Thanks!
Adding up a few integers is really not enough work to warrant starting threads for.
If you forget to add the reduction clause, you'll suffer from true sharing - all threads want to update that count variable at the same time. This makes all cores fight for the cache line containing tha variable, which will considerably impact your performance.
I just noticed that you set the schedule to be dynamic. You shouldn't. This workload can be divided at compile time already. So don't specify a schedule.
As has already been stated inter-loop dependencies, i.e. threads waiting for data from other threads, or data being accessed by multiple threads successively, can cause a paralleled program to experience slow down and should be avoided as a rule of thumb. Built in functions like reductions can collect individual results and compile them together in an optimised fashion.
Here is a good example of reduction being used in a similar case to yours from the university of Utah
int array[8] = { 1, 1, 1, 1, 1, 1, 1, 1};
int sum = 0, i;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < 8; i++) {
sum += array[i];
}
printf("total %d\n", sum);
source: http://www.eng.utah.edu/~cs4960-01/lecture9.pdf
as an aside: private variables need only be assigned when they are local variables inside a parallel region In both cases it is not necessary for i to be declared private.
see wikipedia: https://en.wikipedia.org/wiki/OpenMP#Data_sharing_attribute_clauses
Data sharing attribute clauses
shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.
private: the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.
see stack exchange answer here: OpenMP: are local variables automatically private?
I am writing some code that would definitively benefit from trying to integrate openmp some software that I am writing. I am new to openmp, and while testing some very basic test code (see below) I noticed that the execution times are extremely longer with openmp activated (#pragma line). Any insight is much appreciated.
int main()
{
int number=200;
int max = 2000000;
for(int t=1; t<max; t++)
{
double fac = 0.0;
#pragma omp parallel for reduction(+:fac)
for(int n=2; n<=number; n++)
fac += 1;
}
return 0;
}
As currently written the code encounters the parallel region max times. The overhead of entering a parallel region in an OpenMP program is small, but you incur it 2000000 times. You don't actually tell us what the run times are, but I can readily believe that this makes the them extremely longer than the serial version. I suggest you wrap the outer loop in a parallel region, not the inner loop.
Take care when you rewrite your code to ensure that the payload inside the parallel region is significant, and returns some value(s) to the program outside the parallel region. Absent these steps a crafty optimising compiler can determine that a loop returns nothing to the rest of the program and simply optimise it away.
Also insert some timing instructions (use omp_get_wtime), rerun your code and, if matters are still not satisfactory, update your question with the new information you gather.
This is an improved code that actually works as intended. It basically wraps the outer loop, rather than the inner one. When compiled without openmp support it takes 1.49s, with openmp 0.48s.
int main()
{
int number=200;
int max = 2000000;
#pragma omp parallel for
for(int t=1; t<max; t++)
{
double fac = 0.0;
for(int n=2; n<=number; n++)
fac += 1;
}
return 0;
}
I have a CUDA program that calls the kernel repeatedly within a for loop. The
code computes all rows of a matrix by using the values computed in the previous one
until the entire matrix is done. This is basically a dynamic programming algorithm.
The code below fills the (i,j) entry of many separate matrices in parallel with
the kernel.
for(i = 1; i <=xdim; i++){
for(j = 1; j <= ydim; j++){
start3time = clock();
assign5<<<BLOCKS, THREADS>>>(Z, i, j, x, y, z)
end3time = clock();
diff = static_cast<double>(end3time-start3time)/(CLOCKS_PER_SEC / 1000);
printf("Time for i=%d j=%d is %f\n", i, j, diff);
}
}
The kernel assign5 is straightforward
__global__ void assign5(float* Z, int i, int j, int x, int y, int z) {
int id = threadIdx.x + blockIdx.x * blockDim.x;
char ch = database[j + id];
Z[i+id] = (Z[x+id] + Z[y+id] + Z[z+id])*dev_matrix[i][index[ch - 'A']];
}
}
My problem is that when I run this program the time for each i and j is 0 most of the
time but sometimes it is 10 milliseconds. So the output looks like
Time for i=0 j=0 is 0
Time for i=0 j=1 is 0
.
.
Time for i=15 j=21 is 10
Time for i=15 j=22 is 0
.
I don't understand why this is happening. I don't see a thread race condition. If I add
if(i % 20 == 0) cudaThreadSynchronize();
right after the first loop then the Time for i and j is mostly 0. But then the time
for sync is sometimes 10 or even 20. It seems like CUDA is performing many operations
at low cost and then charges a lot for later ones. Any help would be appreciated.
I think you have a misconception about what a kernel call in CUDA actually does on the host. A kernel call is non-blocking and is only added to the device's queue. If you're measuring time before and after your kernel call, then the difference has nothing to do with how long your kernel call takes (it would measure the time it takes to add the kernel call to the queue).
You should add a cudaThreadSynchronize() after every kernel call and before you measure end3time. cudaThreadSynchronize() blocks and returns if all kernels in the queue have finished their work.
This is why
if(i % 20 == 0) cudaThreadSynchronize();
made spikes in your measurments.