Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
From AMD RDNA white paper, it is said
The RDNA architecture is natively designed for a new narrower wavefront with 32 work-items,
intuitively called wave32, that is optimized for efficient compute. Wave32 offers several
critical advantages for compute and complements the existing graphics-focused wave64
mode.
As we know, the size of wavefront is 64. Does wave32 mean that we can config the wavefront from 64 to 32?
Is there any coding example about wave32?
While the RDNA architecture is optimized for wave32, the existing
wave64 mode can be more effective for some applications. To handle
wave64 instructions, the wave controller issues and executes two wave32 instructions, each operating on half of the work-items of the
wave64 instruction. The default way to handle a wave64 instruction is
simply to issue and execute the upper and lower halves of each
instruction back-to-back – conceptually slicing every instruction
horizontally.
https://www.amd.com/system/files/documents/rdna-whitepaper.pdf
An example of application, CAS
AMD’s FidelityFX suite includes a new approach known as Contrast
Adaptive Sharpening (CAS) that uses post-processing compute shaders
to enhance image quality. CAS enhances details at the interior of an
object, while maintaining the smooth gradients created by the
antialiasing as illustrated in Figure 12. It is a full-screen compute
shader and therefore can work with any type of anti-aliasing and is
particularly effective when paired with temporal antialiasing.
CAS is extremely fast, taking just 0.15 milliseconds for a 2560x1440
frame, and benefits from a variety of features in the RDNA
architecture such as packed integer math for address calculations,
packed fp16 math for compute, faster image loads, and wave32.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Recent developments in gpus (the past few generations) allow them to be programmed. Languages like Cuda, openCL, openACC are specific to this hardware. In addition, certain games allow programming shaders which function in the rendering of images in the graphics pipeline. Just as code intended for a cpu can cause unintended execution resulting a vulnerability, I wonder if a game or other code intended for a gpu can result in a vulnerability.
The benefit a hacker would get from targeting the GPU is "free" computing power without having to deal with the energy cost. The only practical scenario here is crypto-miner viruses, see this article for example. I don't know details on how they operate, but the idea is to use the GPU to mine crypto-currencies in the background, since GPUs are much more efficient than CPUs at this. These viruses will cause substential energy consumption if unnoticed.
Regarding an application running on the GPU causing/using a vulnerability, the use-cases here are rather limited since security-relevant data usually is not processed on GPUs.
At most you could deliberately make the graphics driver crash and this way sabotage other programs from being properly executed.
There already are plenty security mechanisms prohibiting reading other processes' VRAM etc., but there always is some way around.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
In this paper: https://arxiv.org/pdf/1609.08144.pdf "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation", 2016
And at page 12, in Table 1, it is listed that the decoding time for inference on their 2016 neural translation model is almost 3x faster on CPU than GPU. Their model is highly parallelized across GPUs on the depth axis.
Would anyone have any insight?
And would this also mean that generally speaking, it is better to perform the test steps of a neural network on CPU when training on GPU? And would this be true also for models trained on only 1 GPU rather than on many?
They used 88 CPU cores and denoted it as CPU, while only a single GPU is used. Therefore the theoretical peak performance is not that different. Next the data has to be loaded into the GPU which is an overhead, that is not needed on a CPU. The combination of those two factors make the CPU process perform better.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
It seems like for special tasks GPU can be 10x or more powerful than the CPU.
Can we make this power more accessible and utilise it for common programming?
Like having cheap server easily handling millions of connections? Or on-the-fly database analytics? Map/reduce/Hadoop/Storm - like stuff with 10x throughput? Etc?
Is there any movement in such direction? Any new programming languages or programming paradigms that will utilise it?
CUDA or OpenCL are good implementations of GPU programming.
GPU programming uses Shaders to process input buffers and almost instantly generate result buffers. Shaders are small algorithms units, mostly working with float values, which contains their own data context (input buffers and constants) to produce results. Each Shader is isolated from the other Shaders during a task, but you can chain them if required.
GPU programming won't be good at handling HTTP requests since this is mostly a complex sequential process, but it will be amazing to process, for example, a photo or a neural network.
As soon as you can chunk your data into tiny parallel units, then yes it can help. The CPU will remain better for complex sequential tasks.
Colonel Thirty Two links to a long and interesting answer about this if you want more informations : https://superuser.com/questions/308771/why-are-we-still-using-cpus-instead-of-gpus
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I saw a talk by Keith Adams of Facebook comparing machine learning techniques to tuning code for improved performance in the real world. Are there examples of such automation techniques being applied in real projects? I
I know of profile guided optimizations in certain compilers and also some techniques JIT compilers use to improve performance, but I am thinking of more fundamental ways to improve code performance that could require changing the code itself and not code generation. Things like:
Choosing the optimal buffer size in a particular network application or choosing the right stack size for particular application.
Selecting the struct layout in a multi-threaded application that improves local cache performance while reducing false sharing.
Selecting different data structures all together for a certain algorithm.
I read a paper on Halide, an image processing framework that uses genetic algorithms to auto-tune image processing pipelines to improve performance. Examples like this one or any pointers to research would be useful.
Have a look at Remy http://web.mit.edu/remy/
It uses kind of genetic optimization approach to generate algorithm for congestion control in networks, significantly increasing network's performance. One specifies assumptions about network being used, and Remy generates control algorithm to be run on data nodes of this network. The results are amazing, Remy outperforms all human-developed so far optimization techniques.
FFTW is a widely used software package which uses OCaml to generate optimized C code. This paper has more details on the process: http://vuduc.org/pubs/vuduc2000-fftw-dct.pdf
You might also look into Acovea, a genetic algorithm to optimize compiler flags: http://stderr.org/doc/acovea/html/index.html
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
What of this algorithms is output sensitive ? (their base algorithm)
ray tracing
gpu rendering
splatting
How can we make them with acceleration method to be likely output sensitive ?
I think ray tracing and gpu is not output sensitive.
http://en.wikipedia.org/wiki/Output-sensitive_algorithm
For the folks who didn't understand the question, in computer science, an output-sensitive algorithm is an algorithm whose running time depends on the size of the output, instead of or in addition to the size of the input.
Ray Tracing is output sensitive, in fact many ray tracing programs can generate smaller size images or movies in faser time.
GPU rendering is output sensitive, the fact that the GPU can parallelise the task, can speed up, but far less computations are required to render a smaller size image than a bigger one.
Texture splatting, is also output sensitive, since typically textures are repeated, so you can generate a huge image joining many of them, thus requiring more cpu power (and memory).