x64 allows less threads per block than Win32?

x64 allows less threads per block than Win32? - windows

When I am executing some cuda kernel, I noticed that for the many of my own cuda kernels, x64 build would cause failure, whereas Win32 would not.
I am very confused because the cuda source code are the same, and build is fine. It is just when x64 executes, it says it requests too much resource to launch. But shouldn't x64 allows more resources than Win32 in conceptually?
I normally like to use 1024 threads per block if it is possible. So to make x64 code work, I have to downsize the block to 256.
Any one has any idea?

Yes, it's possible. Presumably the issue you are talking about is a registers-per-thread issue.
In 32-bit mode, all pointers are 32-bits and require only one 32-bit register for storage on the GPU. With the exact same source code, those pointers will require 64-bits for storage and therefore will effectively require two 32-bit registers (and, as #njuffa points out below, certain other types can change their size as well, requiring double the registers.) The number of available 32-bit registers is a hardware limit that does not change whether compiling for 32-bit or 64-bit mode, but pointer storage will use twice as many registers in 64-bit mode.
Pointer arithmetic (or arithmetic involving any of the types that increase in size) may likewise be impacted, as some of it may need to be done using 64-bit arithmetic vs. 32-bit arithmetic.
If these registers-per-thread increases in 64-bit mode place your overall usage over the limit, then you will have to use one of a variety of methods to manage it. You've mentioned one already: reduce the number of threads. You can also investigate the nvcc -maxrregcount ... switch, and/or the launch bounds directive.

Related

Is there performance advantage to ARM64

Recently 64-bit ARM mobiles started appearing. But is there any practical advantage to building an application 64-bit? Specifically considering application that does not have much use for the increased virtual address space¹, but would waste some space due to increased pointer size.
So does ARM64 have any other advantages than the larger address that would actually warrant building such application 64bit?
Note: I've seen 64-bit Performance Advantages, but it only mentions x86-64 which does have other improvements besides extended virtual address space. I also recall that the situation is indeed specific to x86 and on some other platforms that went 64-bit like Sparc the usual approach was to only compile kernel and the applications that actually did use lot of memory as 64-bit and everything else as 32-bit.
¹The application is multi-platform and it still needs to be built for and run on devices with as little as 48MiB of memory. Does have some large data that it reads from external storage, but it never needs more than some megabytes of it at once.

I am not sure a general response can be given, but I can provide some examples of differences. There are of course additional differences added in version 8 of the ARM architecture, which apply regardless of target instruction set.
Performance-positive additions in AArch64
32 General-purpose registers gives compilers more wiggle room.
I/D cache synchronization mechanisms accessible from user mode (no system call needed).
Load/Store-Pair instructions makes it possible to load 128-bits of data with one instruction, and still remain RISC-like.
The removal of near-universal conditional execution makes more out-of-ordering possible.
The change in layout of NEON registers (D0 is still lower half of Q0, but D1 is now lower half of Q1 rather than upper half of Q0) makes more out-of-ordering possible.
64-bit pointers make pointer tagging possible.
CSEL enables all kind of crazy optimizations.
Performance-negative changes in AArch64
More registers may also mean higher pressure on the stack.
Larger pointers mean larger memory footprint.
Removal of near-universal conditional execution may cause higher pressure on branch predictor.
Removal of load/store-multiple means more instructions needed for function entry/exit.
Performance-relevant changes in ARMv8-A
Load-Aquire/Store-Release semantics remove need for explicit memory barriers for basic synchronization operations.
I probably forgot lots of things, but those are some of the more obvious changes.

32-bit pointers with the x86-64 ISA: why not?

The x86-64 instruction set adds more registers and other improvements to help streamline executable code. However, in many applications the increased pointer size is a burden. The extra, unused bytes in every pointer clog up the cache and might even overflow RAM. GCC, for example, builds with the -m32 flag, and I assume this is the reason.
It's possible to load a 32-bit value and treat it as a pointer. This doesn't necessitate extra instructions, just load/compute the 32 bits and load from the resulting address. The trick won't be portable, though, as platforms have different memory maps. On Mac OS X, the entire low 4 GiB of address space is reserved. Still, for one program I wrote, hackishly adding 0x100000000L to 32-bit "addresses" before use improved performance greatly over true 64-bit addresses, or compiling with -m32.
Is there any fundamental impediment to having a 32-bit, x86-64 platform? I suppose that supporting such a chimera would add complexity to any operating system, and anyone wanting that last 20% should just Make it Work™, but it still seems that this would be the best fit for a variety of computationally intensive programs.

There is an ABI called "x32" for linux in development. It's a mix between x86_64 and ia32 similar to what you describe - 32 bit address space while using the full 64 bit register set. It needs a custom kernel, binutils and gcc.
Some SPEC runs indicate a performace improvement of about 30% in some benchmarks. See further information at https://sites.google.com/site/x32abi/

As Mysticial commented above, ICC has the -auto-ilp32 / /Qauto-ilp32 option to use 32-bit pointers in 64-bit mode:
Instructs the compiler to analyze the program to determine if there are 64-bit pointers that can be safely shrunk into 32-bit pointers and if there are 64-bit longs (on Linux* systems) that can be safely shrunk into 32-bit longs.
On Windows there's no x32abi like on Linux, but you can still use 32-bit pointers by disabling the /LARGEADDRESSAWARE flag which is enabled for 64-bit binaries by default
By default, 64-bit Microsoft Windows-based applications have a user-mode address space of several terabytes. For precise values, see Memory Limits for Windows and Windows Server Releases. However, applications can specify that the system should allocate all memory for the application below 2 gigabytes. This feature is beneficial for 64-bit applications if the following conditions are true:
A 2 GB address space is sufficient.
The code has many pointer truncation warnings.
Pointers and integers are freely mixed.
The code has polymorphism using 32-bit data types.
All pointers are still 64-bit pointers, but the system ensures that every memory allocation occurs below the 2 GB limit, so that if the application truncates a pointer, no significant data is lost. Pointers can be truncated to 32-bit values, then extended to 64-bit values by either sign extension or zero extension.
Virtual Address Space
Of course there's no direct compiler support like the -mx32 option in GCC, therefore you may need to deal with pointers manually every time you store a pointer to memory or dereference it. The simplest solution is to write a class wrapping a 32-bit pointer to handle that. Luckily MS also had experience on mixed 32 and 64-bit pointers in the same architecture so they have lots of supporting keywords/macros:
POINTER_32/__ptr32
POINTER_64/__ptr64
POINTER_SIGNED/__sptr
POINTER_UNSIGNED/__uptr
Google's V8 engine uses a different way by compressing pointers to 32 bits to save memory as well as improve performance. See the comparison in memory and performance improvement here
See also How does the compressed pointer implementation in V8 differ from JVM's compressed Oops?
Read more
How to use 32-bit pointers in 64-bit application?
Can a C compiler generate an executable 64-bits where pointers are 32-bits?

I do not expect it very hard to support such a model in the OS. About the only thing that needs to change for processes in this model is page management, pages must be allocated below the 4 GB point. The kernel too should allocate its buffers from the first 4 GBs of the virtual address space if it passes them to the application. The same applies to the loader that loads and starts applications. Other than that a 64-bit kernel should be able handle such apps w/o major modifications.
Compiler support shouldn't be a big issue either. It's mostly a matter of generating code that can use the extra CPU registers and their full 64 bits and adding proper REX prefixes whenever needed.

It's called "x86-32 emulation", or WOW64 on Windows (presumably something else on other OSes) and it's a hardware flag in the processor. No need for any user-mode tricks here.

Where is the bottleneck when using 64-bit variables and 32-bit systems?

A colleague and I recently were arguing about whether or not it is a good idea to use 64-bit variables in 32-bit code. I took a side saying "It might be dangerous and slow us down somewhere", he said "Nah. Nothing bad will happen." So who is right?
The bitness of parts of our ecosystem is as follows:
Windows: either 32-bit or 64-bit
Compiler: 32-bit only (Delphi...)
Processor: modern Intel ones... that's 64-bit, isn't it?
Let's say we have some variables which use some 64-bit type built into the language, with Delphi that would be Int64. Is it safe (performance-wise) to scatter the code with calculations based on these?

It sounds like you are doing some premature optimization.
Even if it would cause a very slight performance hit, the hardware of your users are more likely to have 64-bit machines as the software ages. So it would be a self correcting problem, if it even is a problem.
If the compiler supports them, go ahead and use them. If you find a slowdown in a tight loop with lots of calculations, then perhaps consider changing it.

It's bad. The data takes twice the memory, so it's like running on a processor with half the cache. It also requires 2 instructions to load, 2 instructions to add, and 2 instructions to store a pair of numbers. You can test this with simple programs like a prime number sieve. These will generally run measurably slower using the longer types on a 32bit machine, and even on a 64bit machine when you get larger than the cache. The worst part is that once you sprinkle this stuff all over your code, you'll have a systemic performance problem that will be hard to correct later. And yes, similar is even true of using 32 bit integers where shorts could be used, but only with regard to memory performance - a 32bit CPU can do 32bit math just as fast as 16bit - so don't worry about that (or your spreadsheet will only support 65K rows for example).

GCC (and i doubt your Delphi-compiler is much better) compiles/assembles its long long data-datatype so that it uses the EAX and ECX registers, loading them already takes longer than only loading one of them (if it is actually double the time depends on the cache). It then assembles multiple instructions where you would only need one for e.g. ADD/SUB/IDIV if you were only operating on a single 32-bit value. Then follow the stores for EAX and ECX which again take longer than storing only one of them.

On x64 architecture, if you only need 32 bit wide integers, then using 32 bit wide integers will result in faster code, even when running 64 bit code.

Ask yourself this:
Does this variable ever need to store a value that can be bigger than 32-bit, and then choose from the following:
If this answer is "no", use an unsigned int.
If the answer is "yes", use an unsigned 64-bit int. 64-bit ints aren't very memory expensive and are cheap to add or subtract (and reasonably cheap to multiply divide).
If the answer is "yes, but only on 64-bit systems" (for example the length of a buffer might be > 4GB on 64-bit but never on 32-bit), use a "size_t". This is defined to be 32-bit on 32-bit systems and 64-bit on 64-bit systems.
Write your program to be correct, and let the optimiser / performance tests help you get it fast afterwards. It's more expensive for you to write it fast first and fix it later than for you to write it correct first and make it fast later.

Are 64 bit programs bigger and faster than 32 bit versions?

I suppose I am focussing on x86, but I am generally interested in the move from 32 to 64 bit.
Logically, I can see that constants and pointers, in some cases, will be larger so programs are likely to be larger. And the desire to allocate memory on word boundaries for efficiency would mean more white-space between allocations.
I have also heard that 32 bit mode on the x86 has to flush its cache when context switching due to possible overlapping 4G address spaces.
So, what are the real benefits of 64 bit?
And as a supplementary question, would 128 bit be even better?
Edit:
I have just written my first 32/64 bit program. It makes linked lists/trees of 16 byte (32b version) or 32 byte (64b version) objects and does a lot of printing to stderr - not a really useful program, and not something typical, but it is my first.
Size: 81128(32b) v 83672(64b) - so not much difference
Speed: 17s(32b) v 24s(64b) - running on 32 bit OS (OS-X 10.5.8)
Update:
I note that a new hybrid x32 ABI (Application Binary Interface) is being developed that is 64b but uses 32b pointers. For some tests it results in smaller code and faster execution than either 32b or 64b.
https://sites.google.com/site/x32abi/

I typically see a 30% speed improvement for compute-intensive code on x86-64 compared to x86. This is most likely due to the fact that we have 16 x 64 bit general purpose registers and 16 x SSE registers instead of 8 x 32 bit general purpose registers and 8 x SSE registers. This is with the Intel ICC compiler (11.1) on an x86-64 Linux - results with other compilers (e.g. gcc), or with other operating systems (e.g. Windows), may be different of course.

Unless you need to access more memory that 32b addressing will allow you, the benefits will be small, if any.
When running on 64b CPU, you get the same memory interface no matter if you are running 32b or 64b code (you are using the same cache and same BUS).
While x64 architecture has a few more registers which allows easier optimizations, this is often counteracted by the fact pointers are now larger and using any structures with pointers results in a higher memory traffic. I would estimate the increase in the overall memory usage for a 64b application compared to a 32b one to be around 15-30 %.

Regardless of the benefits, I would suggest that you always compile your program for the system's default word size (32-bit or 64-bit), since if you compile a library as a 32-bit binary and provide it on a 64-bit system, you will force anyone who wants to link with your library to provide their library (and any other library dependencies) as a 32-bit binary, when the 64-bit version is the default available. This can be quite a nuisance for everyone. When in doubt, provide both versions of your library.
As to the practical benefits of 64-bit... the most obvious is that you get a bigger address space, so if mmap a file, you can address more of it at once (and load larger files into memory). Another benefit is that, assuming the compiler does a good job of optimizing, many of your arithmetic operations can be parallelized (for example, placing two pairs of 32-bit numbers in two registers and performing two adds in single add operation), and big number computations will run more quickly. That said, the whole 64-bit vs 32-bit thing won't help you with asymptotic complexity at all, so if you are looking to optimize your code, you should probably be looking at the algorithms rather than the constant factors like this.
EDIT:
Please disregard my statement about the parallelized addition. This is not performed by an ordinary add statement... I was confusing that with some of the vectorized/SSE instructions. A more accurate benefit, aside from the larger address space, is that there are more general purpose registers, which means more local variables can be maintained in the CPU register file, which is much faster to access, than if you place the variables in the program stack (which usually means going out to the L1 cache).

I'm coding a chess engine named foolsmate. The best move extraction using a minimax-based tree search to depth 9 (from a certain position) took:
on Win32 configuration: ~17.0s;
after switching to x64 configuration: ~10.3s;
This is 41% of acceleration!

In addition to having more registers, 64-bit has SSE2 by default. This means that you can indeed perform some calculations in parallel. The SSE extensions had other goodies too. But I guess the main benefit is not having to check for the presence of the extensions. If it's x64, it has SSE2 available. ...If my memory serves me correctly.

In the specific case of x68 to x68_64, the 64 bit program will be about the same size, if not slightly smaller, use a bit more memory, and run faster. Mostly this is because x86_64 doesn't just have 64 bit registers, it also has twice as many. x86 does not have enough registers to make compiled languages as efficient as they could be, so x86 code spends a lot of instructions and memory bandwidth shifting data back and forth between registers and memory. x86_64 has much less of that, and so it takes a little less space and runs faster. Floating point and bit-twiddling vector instructions are also much more efficient in x86_64.
In general, though, 64 bit code is not necessarily any faster, and is usually larger, both for code and memory usage at runtime.

Only justification for moving your application to 64 bit is need for more memory in applications like large databases or ERP applications with at least 100s of concurrent users where 2 GB limit will be exceeded fairly quickly when applications cache for better performance. This is case specially on Windows OS where integer and long is still 32 bit (they have new variable _int64. Only pointers are 64 bit. In fact WOW64 is highly optimised on Windows x64 so that 32 bit applications run with low penalty on 64 bit Windows OS. My experience on Windows x64 is 32 bit application version run 10-15% faster than 64 bit since in former case at least for proprietary memory databases you can use pointer arithmatic for maintaining b-tree (most processor intensive part of database systems). Compuatation intensive applications which require large decimals for highest accuracy not afforded by double on 32-64 bit operating system. These applications can use _int64 in natively instead of software emulation. Of course large disk based databases will also show improvement over 32 bit simply due to ability to use large memory for caching query plans and so on.

Any applications that require CPU usage such as transcoding, display performance and media rendering, whether it be audio or visual, will certainly require (at this point) and benefit from using 64 bit versus 32 bit due to the CPU's ability to deal with the sheer amount of data being thrown at it. It's not so much a question of address space as it is the way the data is being dealt with. A 64 bit processor, given 64 bit code, is going to perform better, especially with mathematically difficult things like transcoding and VoIP data - in fact, any sort of 'math' applications should benefit by the usage of 64 bit CPUs and operating systems. Prove me wrong.

On my machine, same h265 encode works almost twice as fast using virtulDub_x64 (with x64 h265 library) vs virtulDub_x32 (regular x32 h265 library). That's probably because longint (64bits) numbers operations (ie: add) can be done on a single instruction on x64, but on 32bit needs two: add lower part, and then add (with carry) the higher part. So unless integer maths are limited to 32bit integers, most of it will take more time under x32.

Drawbacks of using /LARGEADDRESSAWARE for 32-bit Windows executables?

We need to link one of our executables with this flag as it uses lots of memory.
But why give one EXE file special treatment. Why not standardize on /LARGEADDRESSAWARE?
So the question is: Is there anything wrong with using /LARGEADDRESSAWARE even if you don't need it. Why not use it as standard for all EXE files?

blindly applying the LargeAddressAware flag to your 32bit executable deploys a ticking time bomb!
by setting this flag you are testifying to the OS:
yes, my application (and all DLLs being loaded during runtime) can cope with memory addresses up to 4 GB.
so don't restrict the VAS for the process to 2 GB but unlock the full range (of 4 GB)".
but can you really guarantee?
do you take responsibility for all the system DLLs, microsoft redistributables and 3rd-party modules your process may use?
usually, memory allocation returns virtual addresses in low-to-high order. so, unless your process consumes a lot of memory (or it has a very fragmented virtual address space), it will never use addresses beyond the 2 GB boundary. this is hiding bugs related to high addresses.
if such bugs exist they are hard to identify. they will sporadically show up "sooner or later". it's just a matter of time.
luckily there is an extremely handy system-wide switch built into the windows OS:
for testing purposes use the MEM_TOP_DOWN registry setting.
this forces all memory allocations to go from the top down, instead of the normal bottom up.
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management]
"AllocationPreference"=dword:00100000
(this is hex 0x100000. requires windows reboot, of course)
with this switch enabled you will identify issues "sooner" rather than "later".
ideally you'll see them "right from the beginning".
side note: for first analysis i strongly recommend the tool VMmap (SysInternals).
conclusions:
when applying the LAA flag to your 32bit executable it is mandatory to fully test it on a x64 OS with the TopDown AllocationPreference switch set.
for issues in your own code you may be able to fix them.
just to name one very obvious example: use unsigned integers instead of signed integers for memory pointers.
when encountering issues with 3rd-party modules you need to ask the author to fix his bugs. unless this is done you better remove the LargeAddressAware flag from your executable.
a note on testing:
the MemTopDown registry switch is not achieving the desired results for unit tests that are executed by a "test runner" that itself is not LAA enabled.
see: Unit Testing for x86 LargeAddressAware compatibility
PS:
also very "related" and quite interesting is the migration from 32bit code to 64bit.
for examples see:
As a programmer, what do I need to worry about when moving to 64-bit windows?
https://www.sec.cs.tu-bs.de/pubs/2016-ccs.pdf (twice the bits, twice the trouble)

Because lots of legacy code is written with the expectation that "negative" pointers are invalid. Anything in the top two Gb of a 32bit process has the msb set.
As such, its far easier for Microsoft to play it safe, and require applications that (a) need the full 4Gb and (b) have been developed and tested in a large memory scenario, to simply set the flag.
It's not - as you have noticed - that hard.
Raymond Chen - in his blog The Old New Thing - covers the issues with turning it on for all (32bit) applications.

No, "legacy code" in this context (C/C++) is not exclusively code that plays ugly tricks with the MSB of pointers.
It also includes all the code that uses 'int' to store the difference between two pointer, or the length of a memory area, instead of using the correct type 'size_t' : 'int' being signed has 31 bits, and can not handle a value of more than 2 Gb.
A way to cure a good part of your code is to go over it and correct all of those innocuous "mixing signed and unsigned" warnings. It should do a good part of the job, at least if you haven't defined function where an argument of type int is actually a memory length.
However that "legacy code" will probably apparently work right for quite a while, even if you correct nothing.
You'll only break when you'll allocate more than 2 Gb in one block. Or when you'll compare two unrelated pointers that are more than 2 Gb away from each other.
As comparing unrelated pointers is technically an undefined behaviour anyway, you won't encounter that much code that does it (but you can never be sure).
And very frequently even if in total you need more than 2Gb, your program actually never makes single allocations that are larger than that. In fact in Windows, even with LARGEADDRESSAWARE you won't be able by default to allocate that much given the way the memory is organized. You'd need to shuffle the system DLL around to get a continuous block of more than 2Gb
But Murphy's laws says that kind of code will breaks one day, it's just that it will happen very long after you've enable LARGEADDRESSAWARE without checking, and when nobody will remember this has been done.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio