Run task on all available cores - Windows HPC Pack 2008 R2 SP4 - job-scheduling

I'm using Windows HPC Pack 2008 R2 SP4 to run a MPI application. I'm having problems getting the Job Scheduler to run the app on all available cores. Here's my code...
using (IScheduler scheduler = new Scheduler())
{
scheduler.Connect("MyCluster");
var newJob = scheduler.CreateJob();
newJob.Name = "My job";
//newJob.IsExclusive = true;
var singleTask = newJob.CreateTask();
singleTask.WorkDirectory = #"C:\MpiWorkspace";
singleTask.CommandLine = #"mpiexec MyMpiApp.exe";
newJob.AddTask(singleTask);
scheduler.SubmitJob(newJob, null, null);
}
Ran like the above I get allocated 1 (measly) core out of the 16 available across the two Compute Nodes in the cluster. The best I can get is by uncommenting the line newJob.IsExclusive = true;, which then allocates me all the cores on one of the Compute Nodes (8 cores).
If I was running from the command-line I could using the mpiexec switch /np * to use all available cores, but this appears to be overridden by the Job Scheduler.
How do I get the same effect in-code, how can I run on all available cores without explicitly declaring a minimum and maximum number of cores for the task?

Not sure if you have solve it. I have similar problem.
One thing I can think of is that you HPC scheduler override you affinity. Go to "Configuration" -> "Configure job scheduler policies and settings" -> "Affinity", set to "No Jobs"
you also need to add parameters to mpiexec
mpiexec -np * -al 2:L -c 8 MyMpiApp.exe
/al set the affinity
/c set cores per node
Not sure how to do it when your nodes have different number of cores.
Hope this helps

Related

Is there any way to know which processes are running on which core in QNX

My system is running with QNX6.5 and it has 4 cpu cores. But I don't know which and all processes are running in each core. Is there any way to know in detail.
Thanks in advance
Processes typically run multiple threads (at least one - main thread); so the thread is actual running unit, not the process (and core affinity is settable per thread). Thus you'd need to know on which core(s) threads are running.
There is "%l" format option that tells you on what CPU particular thread is executing on:
# pidin -F "%b %50h %i %l" -p random
tid thread name cpu
1 1 0
Runmask : 0x0000007f
Inherit Mask: 0x0000007f
2 Timer Thread 1
Runmask : 0x0000007f
Inherit Mask: 0x0000007f
3 3 6
Runmask : 0x0000007f
Inherit Mask: 0x0000007f
Above we print thread id, thread name, run/inherit cpu masks and top right column is current cpu threads are running on, for the process called "random".
The best tooling for analyzing the details of process scheduling in QNX is the "System Analysis Toolkit", which uses the instrumentation features of the QNX kernel to provide a log of every scheduling event and message pass.
For QNX 6.5, the documentation can be found here: http://www.qnx.com/developers/docs/6.5.0SP1.update/index.html#./com.qnx.doc.instr_en_instr/about.html
Got the details by using below command.
pidin rmasks
which will give "pid, tid, and name" of every threads.
From the runmask value we can identify in which core it is running.
For me thread details also fine.

OpenMp doesn't utilize all CPUs(dual socket, windows and Microsoft visual studio)

I have a dual socket system with 22 real cores per CPU or 44 hyperthreads per CPU. I can get openMP to completely utilize the first CPU(22 cores/44 hyper) but I cannot get it to utilize the second CPU.
I am using CPUID HWMonitor to check my core usage. The second CPU is always at or near 0 % on all cores.
Using:
int nProcessors = omp_get_max_threads();
gets me nProcessors = 44, but I think it's just using the 44 hyperthreads of 1 CPU instead of 44 real cores(should be 88 hyperthreads)
After looking around a lot, I'm not sure how to utilize the other CPU.
My CPU is running fine as I can run other parallel processing programs that utilize all of them.
I'm compiling this in 64 bit but I don't think that matters. Also, I'm using Visual studio 2017 Professional version 15.2. Open MP 2.0(only one vs supports). Running on a windows 10 Pro, 64 bit, with 2 Intel Xeon E5-2699v4 # 2.2Ghz processors.
So answering my own question with thanks to #AlexG for providing some insight. Please see comments section of question.
This is a Microsoft Visual Studio and Windows problem.
First read Processor Groups for Windows.
Basically, if you have under 64 logical cores, this would not be a problem. Once you get past that, however, you will now have two process groups for each socket(or other organization Windows so chooses). In my case, each process group had 44 hyperthreads and represented one physical CPU socket and I had exactly two process groups. Every process(program) by default, is only given access to one process group, hence I initially could only utilize 44 threads on one core. However, if you manually create threads and use SetThreadGroupAffinity to set the thread's processor group to one that is different from your program's initially assigned group, then your program now becomes a multi processor group. This seems like a round-about way to enable multi-processors but yes this is how to do it. A call to GetProcessGroupAffinity will show that the number of groups becomes greater than 1 once you start setting each thread's individual process group.
I was able to create an open MP block like so, and go through and assign process groups:
...
#pragma omp parallel num_threads( 88 )
{
HANDLE thread = GetCurrentThread();
if (omp_get_thread_num() > 32)
{
// Reserved has to be zero'd out after each use if reusing structure...
GroupAffinity1.Reserved[0] = 0;
GroupAffinity1.Reserved[1] = 0;
GroupAffinity1.Reserved[2] = 0;
GroupAffinity1.Group = 0;
GroupAffinity1.Mask = 1 << (omp_get_thread_num()%32);
if (SetThreadGroupAffinity(thread, &GroupAffinity1, &previousAffinity))
{
sprintf(buf, "Thread set to group 0: %d\n", omp_get_thread_num());
OutputDebugString(buf);
}
}
else
{
// Reserved has to be zero'd out after each use if reusing structure...
GroupAffinity2.Reserved[0] = 0;
GroupAffinity2.Reserved[1] = 0;
GroupAffinity2.Reserved[2] = 0;
GroupAffinity2.Group = 1;
GroupAffinity2.Mask = 1 << (omp_get_thread_num() % 32);
if (SetThreadGroupAffinity(thread, &GroupAffinity2, &previousAffinity))
{
sprintf(buf, "Thread set to group 1: %d\n", omp_get_thread_num());
OutputDebugString(buf);
}
}
}
So with the above code, I was able to force 64 threads to run, 32 threads each per socket. Now I couldn't get over 64 threads even though I tried forcing omp_set_num_threads to 88. The reason seems to be linked to Visual Studio's implementation of OpenMP not allowing more than 64 OpenMP threads. Here's a link on that for more information
Thanks all for helping glean some more tidbits that helped in the eventual answer!

Mesos task resources - CPU & Mem

I use Meosos for batch Jobs. Jobs will be running as a docker container by the framework. The are 2 salves running on each VM. The resource for each Job was set to
CPUS - 0.1
MEM - 1G
Its a 4 core machine and mesos was considering it as 8 core as there are 2 slaves in each VM. So, it tried to overload the VM by submitting too many tasks, literally up to 80 jobs ( (4+4)/0.1 = 80). So, during the peak load VM used to crash.
Tried changing the CPU to 0.5 so that the VM will not be overloaded. (( (4+4)/0.5 = 20)). But, looks like CPU usage still goes up to 95%. The tasks are not CPU intensive task, but not sure why it is trying to consume 95%.
Is it like, tasks will be using the resource no matter even it actually requires them? So, it will allocate 0.5 by default or max to 0.5 in case it requires?
Having two agents on the same host/VM is more like an antipattern. If you want to oversubscribe on resources, have a look at the Mesos docs at http://mesos.apache.org/documentation/latest/oversubscription/

How to limit the memory that an application can allocate

I need a way to limit the amount of memory that a service may allocate in order to prevent the service from starving the system, similar to the way SQL Server allows you to set "Maximum server memory".
I know SetProcessWorkingSetSize doesn't do exactly what I want, but I'm trying to get it to behave the way that I believe it should. Regardless of the values that I use, my test app's working set is not limited. Further, if I call GetProcessWorkingSetSize immediately afterwards, the values returned are not what I previously specified. Here's the code used by my test app:
var
MinWorkingSet: SIZE_T;
MaxWorkingSet: SIZE_T;
begin
if not SetProcessWorkingSetSize(GetCurrentProcess(), 20, 12800 ) then
RaiseLastOSError();
if GetProcessWorkingSetSize(GetCurrentProcess(), MinWorkingSet, MaxWorkingSet) then
ShowMessage(Format('%d'#13#10'%d', [MinWorkingSet, MaxWorkingSet]));
No error occurs, but both the Min and Max values returned by GetProcessWorkingSetSize are 81,920.
I tried using SetProcessWorkingSetSizeEx using QUOTA_LIMITS_HARDWS_MAX_ENABLE ($00000004) in the Flags parameter. Unfortunately, SetProcessWorkingSetSizeEx fails with "Code 87. The parameter is incorrect" if I pass anything other than $00000000 in Flags.
I've also pursued using Job Objects to accomplish the same goal. I have memory limits working with Job Objects when launching a child process. However, I need the ability for a service to set its own memory limits rather than depending on a "launching" service to do it. So far, I haven't found a way for a single process to create a job object and then add itself to the job object. This always fails with Access Denied.
Any thoughts or suggestions?
The documentation of SetProcessWorkingSetSize function says:
dwMinimumWorkingSetSize [in]
...
This parameter must be greater than
zero but less than or equal to the maximum working set size. The
default size is 50 pages (for example, this is 204,800 bytes on
systems with a 4K page size). If the value is greater than zero but
less than 20 pages, the minimum value is set to 20 pages.
In case of a 4K page size, the imposed minimum value is 20 * 4096 = 81920 bytes which is the value you saw.
The values are specified in bytes.
To actually limit the memory for your service process, I think it's possible to create a new job (CreateJobObject), set the memory limit (SetInformationJobObject) and assign your current process to the job (AssignProcessToJobObject) in the service's start up routine.
Unfortunately, on Windows before 8 and Server 2012, this won't work if the process already belongs to a job:
Windows 7, Windows Server 2008 R2, Windows XP with SP3, Windows Server
2008, Windows Vista and Windows Server 2003: The process must not
already be assigned to a job; if it is, the function fails with
ERROR_ACCESS_DENIED. This behavior changed starting in Windows 8 and
Windows Server 2012.
If this is your case (ie. you get ERROR_ACCESS_DENIED on older Windows) check if the process is already assigned to a job (in which case, you're out of luck) but also make sure that it has the required access rights: PROCESS_SET_QUOTA and PROCESS_TERMINATE.

Clicking qt .app vs running .exe in terminal

I have a qt gui that spawns a c++11 clang server in osx 10.8 xcode
It does a cryptographic proof-of-work mining of a name (single mining thread)
when i click .app process takes 4 1/2 hours
when i run the exact exe inside the .app folder, from the terminal, process takes 30 minutes
question, how do i debug this?
thank you
====================================
even worse:
mining server running in terminal.
if i start GUI program that connect to server and just sends (ipc) it the "mine" command: 4 hours
if I start a CL-UI that connects to server and just sends (ipc) it the "mine" command: 30 minutes
both cases the server is mining in a tight loop. corrupt memory? single CPU is at 100%, as it should be.. cant figure it out.
=========
this variable is is used w/o locking...
volatile bool running = true;
server thread
fut = std::async(&Commissioner::generateName, &comish, name, m_priv.get_public_key() );
server loop...
nonce_t reset = std::numeric_limits<nonce_t>::max()-1000;
while ( running && hit < target ) {
if ( nt.nonce >= reset )
{
nt.utc_sec = fc::time_point::now();
nt.nonce = 0;
}
else { ++nt.nonce; }
hit = difficulty(nt.id());
}
evidence is now pointing to deterministic chaotic behavior. just very sensitive to initial conditions.
initial condition may be the timestamp data within the object that is hashed during mining.
mods please close.

Resources