Can't get OpenMP to Produce More than One Thread

Can't get OpenMP to Produce More than One Thread - visual-studio-2010

#include <omp.h>
#include <stdio.h>
int main(int argc, char* argv[])
{
omp_set_num_threads(4);
printf("numThreads = %d\n", omp_get_num_threads());
}
This code prints:
numThreads = 1
This is compiled in Visual Studio 2010 Ultimate. I have turned Project Configuration Properties (All Configurations) -> C/C++ -> Language -> Open MP Support to Yes(/openmp)
I'm at a loss. I've isolated this issue from a larger project where I'd like to use more than one thread.
Any ideas?

omp_get_num_threads – Size of the active team
Returns the number of threads in the current team. In a sequential section of the program omp_get_num_threads returns 1.
http://gcc.gnu.org/onlinedocs/libgomp/omp_005fget_005fnum_005fthreads.html#omp_005fget_005fnum_005fthreads
This means, use this function inside of parallel loop to know how many thread OMP uses.

Related

Wanting to understand purpose and count of V8's WorkerThread(s)

I have a very simple program as follows that just creates an isolate, then sleeps:
#include <libplatform/libplatform.h>
#include <v8-platform.h>
#include <v8.h>
#include <stdio.h>
#include <unistd.h>
using v8::Isolate;
int main() {
std::unique_ptr<v8::Platform> platform = v8::platform::NewDefaultPlatform();
v8::V8::InitializePlatform(platform.get());
v8::V8::Initialize();
v8::Isolate::CreateParams create_params;
create_params.array_buffer_allocator = v8::ArrayBuffer::Allocator::NewDefaultAllocator();
Isolate* isolate = v8::Isolate::New(create_params);
printf("Sleeping...\n");
usleep(1000 * 1000 * 100);
printf("Done\n");
return 0;
}
When i run this program, i can then check the number of threads the process has created with ps -T -p <process_id>, and i see that on my 8 core machine, v8 creates 7 extra threads all named "V8 WorkerThread", and on my 16 core machine i get 8 instances of this "V8 WorkerThread" created.
I am looking to understand what determines the number of extra threads v8 spawns and what the purpose of these threads are. Thanks in advance!

The number of worker threads, when not specified by the embedder (that's you!), is chosen based on the number of CPU cores. In the current implementation, the formula is: number_of_worker_threads = (number_of_CPU_cores - 1) up to a maximum of 8, though this may change without notice. You can also specify your own worker thread pool size as an argument to NewDefaultPlatform.
The worker threads are used for various tasks that can be run in the background, mostly for garbage collection and optimized compilation.

how to compile opencl project with kernels

I am totally a beginner on opencl, I searched around the internet and found some "helloworld" demos for opencl project. Usually in such sort of minimal project, there is a *.cl file contains some sort of opencl kernels and a *.c file contains the main function. Then the question is how do I compile this kind of project use a command line. I know I should use some sort of -lOpenCL flag on linux and -framework OpenCL on mac. But I have no idea to link the *.cl kernel to my main source file. Thank you for any comments or useful links.

In OpenCL, the .cl files that contain device kernel codes are usually being compiled and built at run-time. It means somewhere in your host OpenCL program, you'll have to compile and build your device program to be able to use it. This feature enables maximum portability.
Let's consider an example I collected from two books. Below is a very simple OpenCL kernel adding two numbers from two global arrays and saving them in another global array. I save this code in a file named vector_add_kernel.cl.
kernel void vecadd( global int* A, global int* B, global int* C ) {
const int idx = get_global_id(0);
C[idx] = A[idx] + B[idx];
}
Below is the host code written in C++ that exploits OpenCL C++ API. I save it in a file named ocl_vector_addition.cpp beside where I saved my .cl file.
#include <iostream>
#include <fstream>
#include <string>
#include <memory>
#include <stdlib.h>
#define __CL_ENABLE_EXCEPTIONS
#if defined(__APPLE__) || defined(__MACOSX)
#include <OpenCL/cl.cpp>
#else
#include <CL/cl.hpp>
#endif
int main( int argc, char** argv ) {
const int N_ELEMENTS=1024*1024;
unsigned int platform_id=0, device_id=0;
try{
std::unique_ptr<int[]> A(new int[N_ELEMENTS]); // Or you can use simple dynamic arrays like: int* A = new int[N_ELEMENTS];
std::unique_ptr<int[]> B(new int[N_ELEMENTS]);
std::unique_ptr<int[]> C(new int[N_ELEMENTS]);
for( int i = 0; i < N_ELEMENTS; ++i ) {
A[i] = i;
B[i] = i;
}
// Query for platforms
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
// Get a list of devices on this platform
std::vector<cl::Device> devices;
platforms[platform_id].getDevices(CL_DEVICE_TYPE_GPU|CL_DEVICE_TYPE_CPU, &devices); // Select the platform.
// Create a context
cl::Context context(devices);
// Create a command queue
cl::CommandQueue queue = cl::CommandQueue( context, devices[device_id] ); // Select the device.
// Create the memory buffers
cl::Buffer bufferA=cl::Buffer(context, CL_MEM_READ_ONLY, N_ELEMENTS * sizeof(int));
cl::Buffer bufferB=cl::Buffer(context, CL_MEM_READ_ONLY, N_ELEMENTS * sizeof(int));
cl::Buffer bufferC=cl::Buffer(context, CL_MEM_WRITE_ONLY, N_ELEMENTS * sizeof(int));
// Copy the input data to the input buffers using the command queue.
queue.enqueueWriteBuffer( bufferA, CL_FALSE, 0, N_ELEMENTS * sizeof(int), A.get() );
queue.enqueueWriteBuffer( bufferB, CL_FALSE, 0, N_ELEMENTS * sizeof(int), B.get() );
// Read the program source
std::ifstream sourceFile("vector_add_kernel.cl");
std::string sourceCode( std::istreambuf_iterator<char>(sourceFile), (std::istreambuf_iterator<char>()));
cl::Program::Sources source(1, std::make_pair(sourceCode.c_str(), sourceCode.length()));
// Make program from the source code
cl::Program program=cl::Program(context, source);
// Build the program for the devices
program.build(devices);
// Make kernel
cl::Kernel vecadd_kernel(program, "vecadd");
// Set the kernel arguments
vecadd_kernel.setArg( 0, bufferA );
vecadd_kernel.setArg( 1, bufferB );
vecadd_kernel.setArg( 2, bufferC );
// Execute the kernel
cl::NDRange global( N_ELEMENTS );
cl::NDRange local( 256 );
queue.enqueueNDRangeKernel( vecadd_kernel, cl::NullRange, global, local );
// Copy the output data back to the host
queue.enqueueReadBuffer( bufferC, CL_TRUE, 0, N_ELEMENTS * sizeof(int), C.get() );
// Verify the result
bool result=true;
for (int i=0; i<N_ELEMENTS; i ++)
if (C[i] !=A[i]+B[i]) {
result=false;
break;
}
if (result)
std::cout<< "Success!\n";
else
std::cout<< "Failed!\n";
}
catch(cl::Error err) {
std::cout << "Error: " << err.what() << "(" << err.err() << ")" << std::endl;
return( EXIT_FAILURE );
}
std::cout << "Done.\n";
return( EXIT_SUCCESS );
}
I compile this code on a machine with Ubuntu 12.04 like this:
g++ ocl_vector_addition.cpp -lOpenCL -std=c++11 -o ocl_vector_addition.o
It produces a ocl_vector_addition.o, which when I run, shows successful output. If you look at the compilation command, you see we have not passed anything about our .cl file. We only have used -lOpenCL flag to enable OpenCL library for our program. Also, don't get distracted by -std=c++11 command. Because I used std::unique_ptr in the host code, I had to use this flag for a successful compile.
So where is this .cl file being used? If you look at the host code, you'll find four parts that I repeat in below numbered:
// 1. Read the program source
std::ifstream sourceFile("vector_add_kernel.cl");
std::string sourceCode( std::istreambuf_iterator<char>(sourceFile), (std::istreambuf_iterator<char>()));
cl::Program::Sources source(1, std::make_pair(sourceCode.c_str(), sourceCode.length()));
// 2. Make program from the source code
cl::Program program=cl::Program(context, source);
// 3. Build the program for the devices
program.build(devices);
// 4. Make kernel
cl::Kernel vecadd_kernel(program, "vecadd");
In the 1st step, we read the content of the file that holds our device code and put it into a std::string named sourceCode. Then we make a pair of the string and its length and save it to source which has the type cl::Program::Sources. After we prepared the code, we make a cl::program object named program for the context and load the source code into the program object. The 3rd step is the one in which the OpenCL code gets compiled (and linked) for the device. Since the device code is built in the 3rd step, we can create a kernel object named vecadd_kernel and associate the kernel named vecadd inside it with our cl::kernel object. This was pretty much the set of steps involved in compiling a .cl file in a program.
The program I showed and explained about creates the device program from the kernel source code. Another option is to use binaries instead. Using binary program enhances application loading time and allows binary distribution of the program but limits portability since binaries that work fine on one device may not work on another device. Creating program using source code and binary are also called offline and online compilation respectively (more information here). I skip it here since the answer is already too long.

My answer comes four years late. Nevertheless, I have something to add that complements #Farzad's answer, as follows.
Confusingly, in OpenCL practice, the verb to compile is used to mean two different, incompatible things:
In one usage, to compile means what you already think that it means. It means to build at build-time, as from *.c sources to produce *.o objects for build-time linking.
However, in another usage—and this other usage may be unfamiliar to you—to compile means to interpret at run time, as from *.cl sources, producing GPU machine code.
One happens at build-time. The other happens at run-time.
It might have been less confusing had two different verbs been introduced, but that is not how the terminology has evolved. Conventionally, the verb to compile is used for both.
If unsure, then try this experiment: rename your *.cl file so that your other source files cannot find it, then build.
See? It builds fine, doesn't it?
This is because the *.cl file is not consulted at build time. Only later, when you try to execute the binary executable, does the program fail.
If it helps, you can think of the *.cl file as though it were a data file or a configuration file or even a script. It isn't literally a data file, a configuration file or a script, perhaps, for it does eventually get compiled to a kind of machine code, but the machine code is GPU code and it is not made from the *.cl program text until run-time. Moreover, at run-time, your C compiler as such is not involved. Rather, it is your OpenCL library that does the building.
It took me a fairly long time to straighten these concepts in my mind, mostly because—like you—I had long been familiar with the stages of the C/C++ build cycle; and, therefore, I had thought that I knew what words like to compile meant. Once your mind has the words and concepts straight, the various OpenCL documentation begins to make sense, and you can start work.

Read/Write memory on OS X 10.8.2 with vm_read and vm_write

This is my code that works only on Xcode (version 4.5):
#include <stdio.h>
#include <mach/mach_init.h>
#include <mach/mach_vm.h>
#include <sys/types.h>
#include <mach/mach.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <Security/Authorization.h>
int main(int argc, const char * argv[]) {
char test[14] = "Hello World! "; //0x7fff5fbff82a
char value[14] = "Hello Hacker!";
char test1[14];
pointer_t buf;
uint32_t sz;
task_t task;
task_for_pid(current_task(), getpid(), &task);
if (vm_write(current_task(), 0x7fff5fbff82a, (pointer_t)value, 14) == KERN_SUCCESS) {
printf("%s\n", test);
//getchar();
}
if (vm_read(task, 0x7fff5fbff82a, sizeof(char) * 14, &buf, &sz) == KERN_SUCCESS) {
memcpy(test1, (const void *)buf, sz);
printf("%s", test1);
}
return 0;
}
I was trying also ptrace and other things, this is why I include other libraries too.
The first problem is that this works only on Xcode, I can find with the debugger the position (memory address) of a variable (in this case of test), so I change the string with the one on value and then I copy the new value on test on test1.
I actually don't understand how vm_write works (not completely) and the same for task_for_pid(), the 2° problem is that I need to read and write on another process, this is only a test for see if the functions works on the same process, and it works (only on Xcode).
How I can do that on other processes? I need to read a position (how I can find the address of "something"?), this is the first goal.

For your problems, there are solutions:
The first problem: OS X has address space layout randomization. If you want to make your memory images fixed and predictable, you have to compile your code with NOPIE setting. This setting (PIE = Position Independent Executable), is responsible for allowing ASLR, which "slides" the memory by some random value, which changes on every instance.
I actually don't understand how vm_write works (not completely) and the same for task_for_pid():
The Mach APIs operate on the lower level abstractions of "task" and "Thread" which correspond roughly to that of the BSD "process" and "(u)thread" (there are some exceptions, e.g. kernel_task, which does not have a PID, but let's ignore that for now). task_for_pid obtains the task port (think of it as a "handle"), and if you get the port - you are free to do whatever you wish. Basically, the vm_* functions operate on any task port - you can use it on your own process (mach_task_self(), that is), or a port obtained from task_for_pid.
Task for PID actually doesn't necessarily require root (i.e. "sudo"). It requires getting past taskgated on OSX, which traditionally verified membership in procmod or procview groups. You can configure taskgated ( /System/Library/LaunchDaemons/com.apple.taskgated.plist) for debugging purposes. Ultimately, btw, getting the task port will require an entitlement (the same as it now does on iOS). That said, the easiest way, rather than mucking around with system authorizations, etc, is to simply become root.

Did you try to run your app with "sudo"?
You can't read/write other app's memory without sudo.

Console textgame.exe works on windows 7, not on vista… WHY?

Hey so i've made a text game using the pdCurses library and microsoft opperating system tools. Here are my includes and look below for other explination:
#include <iostream>
#include <time.h> // or "ctime"
#include <stdio.h> // for
#include <cstdlib>
#include <Windows.h>
#include <conio.h>
#include<curses.h>
#include <algorithm>
#include <string>
#include <vector>
#include <sstream>
#include <ctime>
#include <myStopwatch.h> // for keeping times
#include <myMath.h> // numb_digits() and digit_val();
myStopwath/Math.h includes:
#include <stdio.h>
#include <math.h>
#include <tchar.h>
So i've tested the game (which includes a folder containing the .exe and pdcurses.dll) on my computer running windows 7 and it works great, however when running it on another computer which has vista or older my game comes up, but immediatly ends due to the loss of all the players lives almost instantaniously.... how could this be?
If you would like to see the full source code, go to this Link
Thanks!

In the main game loop, you are not initializing the coll variable before passing it to theScreen.check_collision(). If the player is in no danger, then that function does not update this value. Back in the main loop you don't check the return value from check_collision(), and the program is now making decisions based on whatever uninitialized value was in that variable. Welcome to the wide world of Undefined Behavior.
It is likely the difference you're seeing on different OS's is due to the way the different heap managers initialize memory pages. Even if your player survives for a while, after the first collision, that memory location now holds 'X', which is then never cleared, and while the result is still "undefined", on most architectures, this will result in registering a new collision on each iteration, explaining why your "lives" are vanishing so quickly.
Two things you need to do to fix this:
All code paths through check_collision must write to the 'buff' out parameter. The easiest way to do this is initialize it to 0 in the first line of the function. (Alternatively, if it's intended as an in/out param, then you need to initialize it in the main loop before calling check_collision() )
Make your decision based on the return value of check_collision(), rather than the out parameter. (Or, if that return value really is not important, change the return type of the function to void)

Line 23 in string_lines is missing a comma at the end. Don't think this is your whole issue, but that can't be good either.
You didn't say if you recompiled it separately under each OS (Vista, etc). And if you did recompile, whether same version of compiler was used.

Windows 7 shipped with Visual C++ 2008 runtime.
Windows Vista shipped Visual C++ 2005 runtime.
XP shipped with Visual C++ 6.0 runtime.
Since you compiled the application in Visual Studio 2010, more than likely it was not compiled to target older operating systems.
Try installing the latest runtimes on the machine you are testing with and if it works after doing that, you know to recompile your project to support older operating systems.
http://www.microsoft.com/download/en/details.aspx?id=5555 x86
http://www.microsoft.com/download/en/details.aspx?id=14632 x64

Critical Sections leaking memory on Vista/Win2008?

It seems that using Critical Sections quite a bit in Vista/Windows Server 2008 leads to the OS not fully regaining the memory.
We found this problem with a Delphi application and it is clearly because of using the CS API. (see this SO question)
Has anyone else seen it with applications developed with other languages (C++, ...)?
The sample code was just initialzing 10000000 CS, then deleting them. This works fine in XP/Win2003 but does not release all the peak memory in Vista/Win2008 until the application has ended.
The more you use CS, the more your application retains memory for nothing.

Microsoft have indeed changed the way InitializeCriticalSection works on Vista, Windows Server 2008, and probably also Windows 7.
They added a "feature" to retain some memory used for Debug information when you allocate a bunch of CS. The more you allocate, the more memory is retained. It might be asymptotic and eventually flatten out (not fully bought to this one).
To avoid this "feature", you have to use the new API InitalizeCriticalSectionEx and pass the flag CRITICAL_SECTION_NO_DEBUG_INFO.
The advantage of this is that it might be faster as, very often, only the spincount will be used without having to actually wait.
The disadvantages are that your old applications can be incompatible, you need to change your code and it is now platform dependent (you have to check for the version to determine which one to use). And also you lose the ability to debug if you need.
Test kit to freeze a Windows Server 2008:
- build this C++ example as CSTest.exe
#include "stdafx.h"
#include "windows.h"
#include <iostream>
using namespace std;
void TestCriticalSections()
{
const unsigned int CS_MAX = 5000000;
CRITICAL_SECTION* csArray = new CRITICAL_SECTION[CS_MAX];
for (unsigned int i = 0; i < CS_MAX; ++i)
InitializeCriticalSection(&csArray[i]);
for (unsigned int i = 0; i < CS_MAX; ++i)
EnterCriticalSection(&csArray[i]);
for (unsigned int i = 0; i < CS_MAX; ++i)
LeaveCriticalSection(&csArray[i]);
for (unsigned int i = 0; i < CS_MAX; ++i)
DeleteCriticalSection(&csArray[i]);
delete [] csArray;
}
int _tmain(int argc, _TCHAR* argv[])
{
TestCriticalSections();
cout << "just hanging around...";
cin.get();
return 0;
}
-...Run this batch file (needs the sleep.exe from server SDK)
#rem you may adapt the sleep delay depending on speed and # of CPUs
#rem sleep 2 on a duo-core 4GB. sleep 1 on a 4CPU 8GB.
#for /L %%i in (1,1,300) do #echo %%i & #start /min CSTest.exe & #sleep 1
#echo still alive?
#pause
#taskkill /im cstest.* /f
-...and see a Win2008 server with 8GB and quad CPU core freezing before reaching the 300 instances launched.
-...repeat on a Windows 2003 server and see it handle it like a charm.

Your test is most probably not representative of the problem. Critical sections are considered "lightweight mutexes" because a real kernel mutex is not created when you initialize the critical section. This means your 10M critical sections are just structs with a few simple members. However, when two threads access a CS at the same time, in order to synchronize them a mutex is indeed created - and that's a different story.
I assume in your real app threads do collide, as opposed to your test app. Now, if you're really treating critical sections as lightweight mutexes and create a lot of them, your app might be allocating a large number of real kernel mutexes, which are way heavier than the light critical section object. And since mutexes are kernel object, creating a excessive number of them can really hurt the OS.
If this is indeed the case, you should reduce the usage of critical sections where you expect a lot of collisions. This has nothing to do with the Windows version, so my guess might be wrong, but it's still something to consider. Try monitoring the OS handles count, and see how your app is doing.

You're seeing something else.
I just built & ran this test code. Every memory usage stat is constant - private bytes, working set, commit, and so on.
int _tmain(int argc, _TCHAR* argv[])
{
while (true)
{
CRITICAL_SECTION* cs = new CRITICAL_SECTION[1000000];
for (int i = 0; i < 1000000; i++) InitializeCriticalSection(&cs[i]);
for (int i = 0; i < 1000000; i++) DeleteCriticalSection(&cs[i]);
delete [] cs;
}
return 0;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio