Set Visual Studio 2010 Express to target 64 bit platforms - visual-studio-2010

I will be upgrading to a paid edition of VS soon, but in the mean time, I'd like to solve something. I know how to edit the project file to specify specify a 32 or 64 bit target:
<PlatformTarget>anycpu</PlatformTarget>
However, I notice an extreme performance drop when executing 64 bit code, which doesn't make sense as I'm running Win 7 Home Premium 64 bit. For example, the following C# takes over 13 times longer to execute compared to the 32 bit int equivalent:
int T = Environment.TickCount;
long j = 0;
for (long i = 0; i < 1000000000; i++)
{
j = i % 1024;
}
MessageBox.Show((Environment.TickCount - T).ToString() + Environment.NewLine +
j.ToString());
I belive that the 32 and 64 bit varients should execute at the same speed on a 64 bit OS. Is there something I need to configure or install to make VS Express compile this into proper 64 bit?
I am executing the release exe.
As a note, I can't edit code in debug mode. VS reports that code cannot be editied in 64 bit mode. This is confusing, because the speed doesn't reflect the statement.

There is some thought on 32 vs. 64-bit perf on Windows here too. Might be helpful: Strange performance behaviour for 64 bit modulo operation

Related

OpenMp doesn't utilize all CPUs(dual socket, windows and Microsoft visual studio)

I have a dual socket system with 22 real cores per CPU or 44 hyperthreads per CPU. I can get openMP to completely utilize the first CPU(22 cores/44 hyper) but I cannot get it to utilize the second CPU.
I am using CPUID HWMonitor to check my core usage. The second CPU is always at or near 0 % on all cores.
Using:
int nProcessors = omp_get_max_threads();
gets me nProcessors = 44, but I think it's just using the 44 hyperthreads of 1 CPU instead of 44 real cores(should be 88 hyperthreads)
After looking around a lot, I'm not sure how to utilize the other CPU.
My CPU is running fine as I can run other parallel processing programs that utilize all of them.
I'm compiling this in 64 bit but I don't think that matters. Also, I'm using Visual studio 2017 Professional version 15.2. Open MP 2.0(only one vs supports). Running on a windows 10 Pro, 64 bit, with 2 Intel Xeon E5-2699v4 # 2.2Ghz processors.
So answering my own question with thanks to #AlexG for providing some insight. Please see comments section of question.
This is a Microsoft Visual Studio and Windows problem.
First read Processor Groups for Windows.
Basically, if you have under 64 logical cores, this would not be a problem. Once you get past that, however, you will now have two process groups for each socket(or other organization Windows so chooses). In my case, each process group had 44 hyperthreads and represented one physical CPU socket and I had exactly two process groups. Every process(program) by default, is only given access to one process group, hence I initially could only utilize 44 threads on one core. However, if you manually create threads and use SetThreadGroupAffinity to set the thread's processor group to one that is different from your program's initially assigned group, then your program now becomes a multi processor group. This seems like a round-about way to enable multi-processors but yes this is how to do it. A call to GetProcessGroupAffinity will show that the number of groups becomes greater than 1 once you start setting each thread's individual process group.
I was able to create an open MP block like so, and go through and assign process groups:
...
#pragma omp parallel num_threads( 88 )
{
HANDLE thread = GetCurrentThread();
if (omp_get_thread_num() > 32)
{
// Reserved has to be zero'd out after each use if reusing structure...
GroupAffinity1.Reserved[0] = 0;
GroupAffinity1.Reserved[1] = 0;
GroupAffinity1.Reserved[2] = 0;
GroupAffinity1.Group = 0;
GroupAffinity1.Mask = 1 << (omp_get_thread_num()%32);
if (SetThreadGroupAffinity(thread, &GroupAffinity1, &previousAffinity))
{
sprintf(buf, "Thread set to group 0: %d\n", omp_get_thread_num());
OutputDebugString(buf);
}
}
else
{
// Reserved has to be zero'd out after each use if reusing structure...
GroupAffinity2.Reserved[0] = 0;
GroupAffinity2.Reserved[1] = 0;
GroupAffinity2.Reserved[2] = 0;
GroupAffinity2.Group = 1;
GroupAffinity2.Mask = 1 << (omp_get_thread_num() % 32);
if (SetThreadGroupAffinity(thread, &GroupAffinity2, &previousAffinity))
{
sprintf(buf, "Thread set to group 1: %d\n", omp_get_thread_num());
OutputDebugString(buf);
}
}
}
So with the above code, I was able to force 64 threads to run, 32 threads each per socket. Now I couldn't get over 64 threads even though I tried forcing omp_set_num_threads to 88. The reason seems to be linked to Visual Studio's implementation of OpenMP not allowing more than 64 OpenMP threads. Here's a link on that for more information
Thanks all for helping glean some more tidbits that helped in the eventual answer!

Why are OpenGL and CUDA contexts memory greedy?

I develop software which usually includes both OpenGL and Nvidia CUDA SDK. Recently, I also started to seek ways to optimize run-time memory footprint. I noticed the following (Debug and Release builds differ only by 4-7 Mb):
Application startup - Less than 1 Mb total
OpenGL 4.5 context creation ( + GLEW loader init) - 45 Mb total
CUDA 8.0 context (Driver API) creation 114 Mb total.
If I create OpenGL context in "headless" mode, the GL context uses 3 Mb less, which probably goes to default frame buffers allocation. That makes sense as the window size is 640x360.
So after OpenGL and CUDA context are up, the process already consumes 114 Mb.
Now, I don't have deep knowledge regarding OS specific stuff that occurs under the hood during GL and CUDA context creation, but 45 Mb for GL and 68 for CUDA seems a whole lot to me. I know that usually several megabytes goes to system frame buffers, function pointers,(probably a bulk of allocations happens on driver side). But hitting over 100 Mb with just "empty" contexts looks too much.
I would like to know:
Why GL/CUDA context creation consumes such a considerable amount of memory?
Are there ways to optimize that?
The system setup under test:
Windows 10 64bit. NVIDIA GTX 960 GPU (Driver Version:388.31). 8 Gb RAM. Visual Studio 2015, 64bit C++ console project.
I measure memory consumption using Visual Studio built-in Diagnostic Tools -> Process Memory section.
UPDATE
I tried Process Explorer, as suggested by datenwolf. Here is the screenshot of what I got, (my process at the bottom marked with yellow):
I would appreciate some explanation on that info. I was always looking at "Private Bytes" in "VS Diagnostic Tools" window. But here I see also "Working Set", "WS Private" etc. Which one correctly shows how much memory my process currently uses? 281,320K looks way too much, because as I said above, the process at the startup does nothing, but creates CUDA and OpenGL contexts.
Partial answer: This is an OS-specific issue; on Linux, CUDA takes 9.3 MB.
I'm using CUDA (not OpenGL) on GNU/Linux:
CUDA version: 10.2.89
OS distribution: Devuan GNU/Linux Beowulf (~= Debian Buster without systemd)
Kernel: Linux 5.2.0
Processor: Intel x86_64
To check how much memory gets used by CUDA when creating a context, I ran the following C program (which also checks what happens after context destruction):
#include <stdio.h>
#include <cuda.h>
#include <malloc.h>
#include <stdlib.h>
static void print_allocation_stats(const char* s)
{
printf("%s:\n", s);
printf("--------------------------------------------------\n");
malloc_stats();
printf("--------------------------------------------------\n\n");
}
int main()
{
display_mallinfo("Initially");
int status = cuInit(0);
if (status != 0 ) { return EXIT_FAILURE; }
print_allocation_stats("After CUDA driver initialization");
int device_id = 0;
unsigned flags = 0;
CUcontext context_id;
status = cuCtxCreate(&context_id, flags, device_id);
if (status != CUDA_SUCCESS ) { return EXIT_FAILURE; }
print_allocation_stats("After context creation");
status = cuCtxDestroy(context_id);
if (status != CUDA_SUCCESS ) { return EXIT_FAILURE; }
print_allocation_stats("After context destruction");
return EXIT_SUCCESS;
}
(note that this uses a glibc-specific function, not in the standard library.)
Summarizing the results and snipping irrelevant parts:
Point in program
Total bytes
In-use
Max MMAP Regions
Max MMAP bytes
Initially
135168
1632
0
0
After CUDA driver initialization
552960
439120
2
307200
After context creation
9314304
6858208
8
6643712
After context destruction
7016448
580688
8
6643712
So CUDA starts with 0.5 MB and after allocating a context takes up 9.3 MB (going back down to 7.0 MB on destroying the context). 9 MB is still a lot of memory for not having done anything; but - maybe some of it is all-zeros, or uninitialized, or copy-on-write, in which case it doesn't really take up that much memory.
It's possible that memory use improved dramatically over the two years between the driver release with CUDA 8 and with CUDA 10, but I doubt it. So - it looks like your problem is Windows specific.
Also, I should mention I did not create an OpenGL context - which is another part of OP's question; so I haven't estimated how much memory that takes. OP brings up the question of whether the sum is greater than its part, i.e. whether a CUDA context would take more memory if an OpenGL context existed as well; I believe this should not be the case, but readers are welcome to try and report...

Windows 64-bit and 32-bit incompatibilities

I know that 64-bit applications need 64-bit Windows.
Which c/c++ code will work only for 64-bit or 32-bit exclusively?
Edit: I have found it here
Can I determine proccess word size on runtime: Like I will have 32-bit application which returns if OS is 32 or 64 bit and then runs sub/new proccess with right word size.
You can find out if your system is 32-bit or 64-bit with GetNativeSystemInfo. For example, you could do something like this:
typedef void (WINAPI *GetNativeSystemInfo_t)(LPSYSTEM_INFO lpSystemInfo);
BOOL IsSystem64Bit()
{
HANDLE kernel32 = LoadLibrary("kernel32.dll");
SYSTEM_INFO si;
GetNativeSystemInfo_t GetNativeSystemInfoPtr
= (GetNativeSystemInfo_t)GetProcAddress(kernel32, "GetNativeSystemInfo");
if (GetNativeSystemInfoPtr == NULL)
return FALSE;
GetNativeSystemInfoPtr(&si);
return (si.wProcessorArchitecture == PROCESSOR_ARCHITECTURE_AMD64);
}
The reason the function is resolved dynamically is because it doesn't exist on versions of Windows prior to XP. (And on those versions of windows, we already know that the system is not 64-bit)
I'm not sure about Windows, and so obviously this will be limited in helpfulness, but on Linux you can determine word size at runtime. A long int will be the word size. On 64-bit Linux long is 64-bits and 32-bits on 32-bit Linux.
So, this seems really stupid and inconsistent, but you could do something like
char ws[3];
sprintf(ws, "%d", sizeof(long));
fprintf(stderr, "%s\n", ws);
You can then compare ws with different values to see what the word size is. I'm sure that Windows has a comparable basic type that can help you tell what the word size is.

Qt get system 32bit or 64 bit info?

I am writing a software using Qt. One of my task is to judge whether Windows OS is 32bit or 64bit, and then do the following operations according this fact.
However, when I was trying "QSysInfo::WordSize", it always return 32 while I was actually running on Windows 7-64 bit OS.
I also tried
#ifdef _WIN32
return 32;
#elif _WIN64
return 64;
This also returns 32.
Actually Qt is 32bit in my system. Is that the problem?
How can I get the actual word size of Windows?
Thanks
I personally would call GetNativeSystemInfo and check the value of the wProcessorArchitecture field.
The _WIN32 and _WIN64 macros are, like all macros, evaluated at compile time. They tell you about the architecture of your executable file rather than the architecture of the system on which the executable runs. That latter information, the information that you want, can only be determined at runtime.
QSysInfo::WordSize only tells you if the application is compiled on a 32-bit platform or a 64-bit platform. So, yes, in a way being compiled using a 32-bit Qt will return a word size of 32.
For your case, you might want to check IsWow64Process.
This should work in any c++ environment, including Qt's, on any system that doesn't use "segment registers" (IOW, has a properly flat memory space):
uint32_t archwidth = sizeof(int *); // arch width in bytes
uint32_t archbits = 8 * archwidth; // arch width in bits
The mechanism here is:
On a 64-bit architecture (like the XEON) the CPU will use 8 byte (64-bit) pointers, and so archwidth will return 8; and archbits is then 8*8, or 64.
On a 32-bit architecture (like the 68000) the CPU will use 4 byte (32-bit) pointers, and so archwidth will return 4; and archbits is then 4*8, or 32.
On a 16-bit architecture (like the 6809) the CPU will use 2 byte (16-bit) pointers, and so archwidth will return 2; and archbits is then 2*8, or 16.
You can use Q_PROCESSOR_WORDSIZE (or here). I'm surprised it's not documented because it's not in a private header (QtGlobal) and is quite handy.
It could be more preferable for some use cases because it doesn't depend on processor architecture. (e.g. it's defined the same way for x86_64 as well as arm64 and many others)
Example:
#include <QtGlobal>
#include <QDebug>
int main() {
#if Q_PROCESSOR_WORDSIZE == 4
qDebug() << "32-bit executable";
#elif Q_PROCESSOR_WORDSIZE == 8
qDebug() << "64-bit executable";
#else
qDebug() << "Processor with unexpected word size";
#endif
}
or even better:
int main() {
qDebug() << QStringLiteral("%1-bit executable").arg(Q_PROCESSOR_WORDSIZE * 8);
}

Get number of cores on a XP 64 bit system

Hej,
I wrote a function that should give me the number of cores of a windows system. It works on all systems except XP 64 bit. Here's the way I get the information:
$objWMIItems = $objWMIService.ExecQuery ("SELECT * FROM Win32_Processor")
If (0 == IsObj($objWMIItems)) Then
;~ errorhandling
Else
For $objElement In $objWMIItems
$nCoreNumber = $objElement.NumberOfCores
Next
Regarding "NumberOfCores", Microsofts MSDN page tells me "Windows Server 2003, Windows XP, and Windows 2000: This property is not available". Somewhere I read, it is possible with having SP3 installed. I suppose that's true, because it works that way on XP 32 bit systems. But there is no SP3 for XP 64...
Is there another way to get the information?
Thanks
I think it's easiest to read the NUMBER_OF_PROCESSORS environment variable.
Do you want "cores" or "number of logical processors including hyperthreading"? (In other words, do you want to count hyperthreading as a "core")?
In any case, copying my answer from a similar question a while back:
If you actually need to distinguish between actual cores, chips, and
logical processors, the API to call is
GetLogicalProcessInformation
GetSystemInfo if just want to know how many logical processors on
a machine (with no differentiation for hyperthreading.).

Resources