I have read https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-bar which details about PTX synchronization function.
It says there are 16 "barrier logical resource", and you can specify which barrier to use with the parameter "a". What is a barrier logical resource?
I have a piece of code from an outside source, which I know works. However, I cannot understand the syntax used inside "asm" and what "memory" does. I assume "name" replaces "%0" and "numThreads" replace "%1", but what is "memory" and what are the colons doing?
__device__ __forceinline__ void namedBarrierSync(int name, int numThreads) {
asm volatile("bar.sync %0, %1;" : : "r"(name), "r"(numThreads) : "memory");}
In a block of 256 threads, I only want threads 64 ~ 127 to synchronize. Is this possible with barrier.sync
function? ( for an example, say I have a grid of 1 block, block of 256 threads. we split the block into 3 conditional branches s.t. threads 0 ~ 63 go into kernel1, threads 64 ~ 127 go into kernel 2, and threads 128 ~ 255 go into kernel 3. I want threads in kernel 2 to only synchronize among themselves. So if I use the "namedBarrierSync" function defied above: "namedBarrierSync( 1, 64)". Then does it synchronize only threads 64 ~ 127, or threads 0 ~ 63?
I have tested with below code ( assume that gpuAssert is an error checking function defined somewhere in the file ).
Here is the code:
__global__ void test(int num_threads)
{
if (threadIdx.x >= 64 && threadIdx.x < 128)
{
namedBarrierSync(0, num_threads) ;
}
__syncthreads();
}
int main(void)
{
test<<<1, 1, 256>>>(128);
gpuAssert(cudaDeviceSynchronize(), __FILE__, __LINE_);
printf("complete\n");
return 1;
}
"barrier logical resource" are the hardware necessary to synchronize threads/warps in a thread block (probably atomic counters etc.). You don't need to know the actual hardware implementation to program them, it is sufficient to know there are 16 instances of them available.
As Robert Crovella has pointed out in your cross-post on the Nvidia forum, the documentation for inline PTX is at https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html.
barrier.sync with a named barrier and thread count of 64 synchronizes the first two warps arriving at the named barrier (for compute capability up to 6.x) or the first 64 threads arriving at the named barrier (for compute capability 7.0 onwards).
Your test only launches a single thread (with 256 bytes of shared memory allocated to it), which makes tests of synchronisation instructions moot. You want to launch the test kernel as test<<<1, 256>>>(128); instead.
Related
When I call memmap in the UEFI shell, I got two different attributes like the following:
Shell> memmap
Type Start End # Pages Attributes
Available 0000000080000000-00000000CFFFFFFF 0000000000050000 000000000000000E
Available 0000002000000000-000000237FFFFFFF 0000000000380000 000000000000000F
The problem is that I cannot use the second region of memory whose attribute marks as 000000000000000F. I've already registered that part of memory to my page table. But, my OS will panic when I convert a physical address from that region to a virtual address.
So, my problem is:
What does the attribute means?
How can I change the attribute so that I can use that part of memory?
The memmap command displays the memory map that is maintained by the UEFI environment by listing the contents of EFI_MEMORY_DESCRIPTOR for each memory region.
typedef struct {
UINT32 Type;
EFI_PHYSICAL_ADDRESS PhysicalStart;
EFI_VIRTUAL_ADDRESS VirtualStart;
UINT64 NumberOfPages;
UINT64 Attribute;
} EFI_MEMORY_DESCRIPTOR;
The Attribute field of a memory region describes the bit mask of capabilities for that memory region, and not necessarily the current settings for that memory region.
//*******************************************************
// Memory Attribute Definitions
//*******************************************************
// These types can be “ORed” together as needed.
#define EFI_MEMORY_UC 0x0000000000000001
#define EFI_MEMORY_WC 0x0000000000000002
#define EFI_MEMORY_WT 0x0000000000000004
#define EFI_MEMORY_WB 0x0000000000000008
#define EFI_MEMORY_UCE 0x0000000000000010
#define EFI_MEMORY_WP 0x0000000000001000
#define EFI_MEMORY_RP 0x0000000000002000
#define EFI_MEMORY_XP 0x0000000000004000
#define EFI_MEMORY_NV 0x0000000000008000
#define EFI_MEMORY_MORE_RELIABLE 0x0000000000010000
#define EFI_MEMORY_RO 0x0000000000020000
#define EFI_MEMORY_SP 0x0000000000040000
#define EFI_MEMORY_CPU_CRYPTO 0x0000000000080000
#define EFI_MEMORY_RUNTIME 0x8000000000000000
Full details about each of these attributes can be found in the UEFI Specification Section 7.2 under GetMemoryMap.
The only difference in attribute value between your two memory regions is EFI_MEMORY_UC, i.e. memory is not cacheable.
I have a problem debugging an stm32f407vet6 board and rust code.
The point of the problem is that GDB ignores breakpoints.
After setting breakpoints and executing the "continue" command in gdb, the program continues to ignore all breakpoints.
The only way to stop the program running is to cause an interrupt using the "ctrl + c" command.
After this command, the board stops its execution on the line currently being executed.
I have tried to set breakpoints on all lines where I can set them, but all the attempts are unsuccessful.
$ openocd
Open On-Chip Debugger 0.10.0 (2020-07-01) [https://github.com/sysprogs/openocd]
Licensed under GNU GPL v2
libusb1 09e75e98b4d9ea7909e8837b7a3f00dda4589dc3
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
Info : auto-selecting first available session transport "hla_swd". To override use 'transport select <transport>'.
Info : The selected transport took over low-level target control. The results might differ compared to plain JTAG/SWD
Info : Listening on port 6666 for tcl connections
Info : Listening on port 4444 for telnet connections
Info : clock speed 2000 kHz
Error: libusb_open() failed with LIBUSB_ERROR_NOT_SUPPORTED
Info : STLINK V2J35S7 (API v2) VID:PID 0483:3748
Info : Target voltage: 6.436364
Info : stm32f4x.cpu: hardware has 6 breakpoints, 4 watchpoints
Info : starting gdb server for stm32f4x.cpu on 3333
Info : Listening on port 3333 for gdb connections
$ arm-none-eabi-gdb -q target\thumbv7em-none-eabihf\debug\test_blink
Reading symbols from target\thumbv7em-none-eabihf\debug\test_blink...
(gdb) target remote :3333
Remote debugging using :3333
0x00004070 in core::ptr::read_volatile (src=0xe000e010) at C:\Users\User\.rustup\toolchains\stable-x86_64-pc-windows-msvc\lib/rustlib/src/rust\src/libcore/ptr/mod.rs:1005
1005 pub unsafe fn read_volatile<T>(src: *const T) -> T {
(gdb) load
Loading section .vector_table, size 0x1a8 lma 0x0
Loading section .text, size 0x47bc lma 0x1a8
Loading section .rodata, size 0xbf0 lma 0x4970
Start address 0x47a2, load size 21844
Transfer rate: 100 KB/sec, 5461 bytes/write.
(gdb) b main
Breakpoint 1 at 0x1f2: file src\main.rs, line 15.
(gdb) continue
Continuing.
Program received signal SIGINT, Interrupt.
0x00001530 in cortex_m::peripheral::syst::<impl cortex_m::peripheral::SYST>::has_wrapped (self=0x1000fc6c)
at C:\Users\User\.cargo\registry\src\github.com-1ecc6299db9ec823\cortex-m-0.6.3\src\peripheral/syst.rs:135
135 pub fn has_wrapped(&mut self) -> bool {
(gdb) bt
#0 0x00001530 in cortex_m::peripheral::syst::<impl cortex_m::peripheral::SYST>::has_wrapped (self=0x1000fc6c)
at C:\Users\User\.cargo\registry\src\github.com-1ecc6299db9ec823\cortex-m-0.6.3\src\peripheral/syst.rs:135
#1 0x00003450 in <stm32f4xx_hal::delay::Delay as embedded_hal::blocking::delay::DelayUs<u32>>::delay_us (self=0x1000fc6c, us=500000)
at C:\Users\User\.cargo\registry\src\github.com-1ecc6299db9ec823\stm32f4xx-hal-0.8.3\src/delay.rs:69
#2 0x0000339e in <stm32f4xx_hal::delay::Delay as embedded_hal::blocking::delay::DelayMs<u32>>::delay_ms (self=0x1000fc6c, ms=500)
at C:\Users\User\.cargo\registry\src\github.com-1ecc6299db9ec823\stm32f4xx-hal-0.8.3\src/delay.rs:32
#3 0x00000318 in test_blink::__cortex_m_rt_main () at src\main.rs:40
#4 0x000001f6 in main () at src\main.rs:15
memory.x file:
MEMORY
{
/* NOTE 1 K = 1 KiBi = 1024 bytes */
/* TODO Adjust these memory regions to match your device memory layout */
/* These values correspond to the LM3S6965, one of the few devices QEMU can emulate */
CCMRAM : ORIGIN = 0x10000000, LENGTH = 64K
RAM : ORIGIN = 0x20000000, LENGTH = 128K
FLASH : ORIGIN = 0x00000000, LENGTH = 512K
}
/* This is where the call stack will be allocated. */
/* The stack is of the full descending type. */
/* You may want to use this variable to locate the call stack and static
variables in different memory regions. Below is shown the default value */
_stack_start = ORIGIN(CCMRAM) + LENGTH(CCMRAM);
/* You can use this symbol to customize the location of the .text section */
/* If omitted the .text section will be placed right after the .vector_table
section */
/* This is required only on microcontrollers that store some configuration right
after the vector table */
/* _stext = ORIGIN(FLASH) + 0x400; */
/* Example of putting non-initialized variables into custom RAM locations. */
/* This assumes you have defined a region RAM2 above, and in the Rust
sources added the attribute `#[link_section = ".ram2bss"]` to the data
you want to place there. */
/* Note that the section will not be zero-initialized by the runtime! */
/* SECTIONS {
.ram2bss (NOLOAD) : ALIGN(4) {
*(.ram2bss);
. = ALIGN(4);
} > RAM2
} INSERT AFTER .bss;
*/
openocd.cfg file:
# Sample OpenOCD configuration for the STM32F3DISCOVERY development board
# Depending on the hardware revision you got you'll have to pick ONE of these
# interfaces. At any time only one interface should be commented out.
# Revision C (newer revision)
source [find interface/stlink.cfg]
# Revision A and B (older revisions)
# source [find interface/stlink-v2.cfg]
source [find target/stm32f4x.cfg]
# use hardware reset, connect under reset
# reset_config none separate
main.rs file:
#![no_main]
#![no_std]
#![allow(unsafe_code)]
// Halt on panic
#[allow(unused_extern_crates)] // NOTE(allow) bug rust-lang/rust#53964
extern crate panic_halt; // panic handler
use cortex_m;
use cortex_m_rt::entry;
use stm32f4xx_hal as hal;
use crate::hal::{prelude::*, stm32};
#[entry]
fn main() -> ! {
if let (Some(dp), Some(cp)) = (
stm32::Peripherals::take(),
cortex_m::peripheral::Peripherals::take(),
) {
let rcc = dp.RCC.constrain();
let clocks = rcc
.cfgr
.sysclk(168.mhz())
.freeze();
let mut delay = hal::delay::Delay::new(cp.SYST, clocks);
let gpioa = dp.GPIOA.split();
let mut l1 = gpioa.pa6.into_push_pull_output();
let mut l2 = gpioa.pa7.into_push_pull_output();
loop {
l1.set_low().unwrap();
l2.set_high().unwrap();
delay.delay_ms(500u32);
l1.set_high().unwrap();
l2.set_low().unwrap();
delay.delay_ms(500u32);
}
}
loop {}
}
Cargo.toml file:
[package]
name = "test_blink"
version = "0.1.0"
authors = ["Alex"]
edition = "2018"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
embedded-hal = "0.2"
nb = "0.1.2"
cortex-m = "0.6"
cortex-m-rt = "0.6"
# Panic behaviour, see https://crates.io/keywords/panic-impl for alternatives
panic-halt = "0.2"
cortex-m-log="0.6.2"
[dependencies.stm32f4xx-hal]
version = "0.8.3"
features = ["rt", "stm32f407"]
I am new to rust embedded and maybe I have done something wrong, but I have already tried all the options I can find on the Internet.
At first I thought it was a problem with the cortex-debug plugin for vscode and even created the issue, but the guys couldn't help me because the problem is obviously not on their side.
Debugging "C" code in cubeIDE works, so I dare to assume that the problem is somewhere in rust--gdb--openocd. Perhaps I am missing something, but unfortunately I cannot find it myself yet.
I would appreciate any resources or ideas to solve this problem.
I'm hoping you checked out this resources:
Discovery - debug
From your screen-grab of arm-none-eabi-gdb it does indeed look it it did not hit the break point.
you should have seen this message afterwards:
Note: automatically using hardware breakpoints for read-only addresses.
Breakpoint 1, main () at ...
Did you compile your source with symbols, and unoptimised?
Your config all looks right to me otherwise.
I am trying to put together and example of coding inline assembly code in a 'C' program. I have not had any success. I am using GCC 4.9.0. As near as I can tell my syntax is correct. However the build blows up with the following errors:
/tmp/ccqC2wtq.s:48: Error: syntax error; found `(', expected `,'
Makefile:51: recipe for target 'all' failed
/tmp/ccqC2wtq.s:48: Error: junk at end of line: `(31)'
/tmp/ccqC2wtq.s:49: Error: syntax error; found `(', expected `,'
/tmp/ccqC2wtq.s:49: Error: junk at end of line: `(9)'
These are related to the input/output/clobbers lines in the code. Anyone have an idea where I went wrong?
asm volatile("li 7, %0\n" // load destination address
"li 8, %1\n" // load source address
"li 9, 100\n" // load count
// save source address for return
"mr 3,7\n"
// prepare for loop
"subi 8,8,1\n"
"subi 9,9,1\n"
// perform copy
"1:\n"
"lbzu 10,2(8)\n"
"stbu 10,2(7)\n"
"subi 9,9,1\n" // Decrement the count
"bne 1b\n" // If zero, we've finished
"blr\n"
: // no outputs
: "m" (array), "m" (stringArray)
: "r7", "r8"
);
It's not clear what you are trying to do with the initial instructions:
li 7, %0
li 8, %1
Do you want to load the address of the variables into those registers? In general, this is not possible because the address is not representable in an immediate operand. The easiest way out is to avoid using r7 and r8, and instead use %0 and %1 directly as the register names. It seems that you want to use these registers as base addresses, so you should use the b constraint:
: "b" (array), "b" (stringArray)
Then GCC will take care of the details of materializing the address in a suitable fashion.
You cannot return using blr from inline assembler because it's not possible to tear down the stack frame GCC created. You also need to double-check the clobbers and make sure that you list all the things you clobber (including condition codes, memory, and overwritten input operands).
I was able to get this working by declaring a pair of pointers and initializing them with the addresses of the arrays. I didn't realize that the addresses wouldn't be available directly. I've used inline assembly very sparsely in the past, usually just
to raise or lower interrupt masks, not to do anything that references variables. I just
had some folks who wanted an example. And the "blr" was leftover when I copied a
snipet of a pure assembly routine to use as a starting point. Thanks for the responses.
The final code piece looks like this:
int main()
{
char * stringArrayPtr;
unsigned char * myArrayPtr;
unsigned char myArray[100];
stringArrayPtr = (char *)&stringArray;
myArrayPtr = myArray;
asm volatile(
"lwz 7,%[myArrayPtr]\n" // load destination address
"lwz 8, %[stringArrayPtr]\n" // load source address
"li 9, 100\n" // load count
"mr 3,7\n" // save source address for return
"subi 8,8,1\n" // prepare for loop
"subi 9,9,1\n"
"1:\n" // perform copy
"lbzu 10,1(8)\n"
"stbu 10,1(7)\n"
"subic. 9,9,1\n" // Decrement the count
"bne 1b\n" // If zero, we've finished
// The outputs / inputs / clobbered list
// No output variables are specified here
// See GCC documentation on extended assembly for details.
:
: [stringArrayPtr] "m" (stringArrayPtr), [myArrayPtr]"m"(myArrayPtr)
: "7","8","9","10"
);
}
I need to parse the XML files generated by Visual Studio's performance profiler, and get information about CPU usage of each function in my program (similar to what Visual Studio displays in the diagnostics tools and performance profiler).
I tried looking through all the XML files that I can generate using the tool (call tree summary, function summary, process summary, etc.) but I don't seem to find the information regarding CPU usage.
Which XML views should I export to get this information? Also, where can I find CPU usage? What are InclSamples and ExclSamples? Here is an example from my function summary XML:
<Function FunctionName="MatrixMultiply.Program.FunctionName" InclSamples="1,444" ExclSamples="881" InclSamplesPercent="97.30" ExclSamplesPercent="59.37" />
For generating the XML files correctly, you have to choose a target and checkbox the analyze CPU Usage in the Performance Profiler (Alt + F2) which is available in Debug > Performance Profiler from the menu dropdown. This window is the diagsession window which asks you to choose the performance reports you need and start the profiling.
- CPU Usage
- Memory Usage
- GPU Usage
Once you export the report into a VSPX file, VS2017 allows you to export this binary file into readable CSVs/XML files. In the menu that shows up, you could choose different reports you'd like. Some of the reports that the profiler can generate for you are:
Caller Callee Summary
Call Tree Summary
Function Summary
Header Summary
IP Summary
Line Summary
Marks Summary
Module Summary
Process Summary
Process Thread Summary
Thread Summary
It looks like you're most interested in the CallTreeSummary, the exported XML file will have the name <reportname>_CallTreeSummary.xml which contains the <PerformanceReport> consisting of <CallTreeSummary>
Each CallTree call in the CallTreeSummary contains the functionName, the InclSamples, ExclSamples and their respective percentages.
Here's an example:
For a sample C++ code as follows (Memory leak sample):
int main() {
while (true) {
int *p = new int;
}
return 0;
}
A part of the CallTree shows up as follows:
<CallTree Level="8" FunctionName="operator new" InclSamples="20,560" ExclSamples="78" InclSamplesPercent="97.46" ExclSamplesPercent="0.37" ModuleName="Sample.exe" />
<CallTree Level="9" FunctionName="[ucrtbased.dll]" InclSamples="20,482" ExclSamples="55" InclSamplesPercent="97.09" ExclSamplesPercent="0.26" ModuleName="ucrtbased.dll" />
<CallTree Level="10" FunctionName="[ucrtbased.dll]" InclSamples="20,427" ExclSamples="83" InclSamplesPercent="96.83" ExclSamplesPercent="0.39" ModuleName="ucrtbased.dll" />
<CallTree Level="11" FunctionName="[ucrtbased.dll]" InclSamples="20,344" ExclSamples="177" InclSamplesPercent="96.44" ExclSamplesPercent="0.84" ModuleName="ucrtbased.dll" />
<CallTree Level="12" FunctionName="[ucrtbased.dll]" InclSamples="20,119" ExclSamples="2,092" InclSamplesPercent="95.37" ExclSamplesPercent="9.92" ModuleName="ucrtbased.dll" />
InclSamples represents the total number of ticks taken to execute the function and any related functions that are called.
ExclSamples represents the total number of ticks taken to execute only the function.
For an illustrative example, consider the following example:
int bar() {
return 1;
}
int foo() {
return bar();
}
int main() {
int x = foo();
return 0;
}
A sample execution could show the following data:
<FunctionName="main" InclSamples="100" ExclSamples="10" InclSamplesPercent="100.00" .../>
<FunctionName="foo" InclSamples="90" ExclSamples="40" InclSamplesPercent="90.00".../>
<FunctionName="bar" InclSamples="50" ExclSamples="50" InclSamplesPercent="50.00" .../>
This is interpreted as follows:
Running the main() function takes 100 CPU ticks in total but only 10 of those ticks resulted in executing this function, leaving us with 90 ticks which were used in other functions being called from main()
Running the foo() function takes 90 CPU ticks (since it includes the run time of foo() + bar() but exclusively the foo() takes 40 ticks.
The bar() function takes 50 ticks to run.
Using the reasoning above, you could reason the CallTree sample provided above. The associated InclSamplesPercent is the percentage of time taken by the CPU to run the overall task. For e.g. from the sample above we can say that 100% of the CPU usage was by the main() function but 90% of that was taken up by foo() function and 50% by bar() function effectively making the ExclSamplesPercent by foo() as 90 - 50 = 40.00% and ExclSamplesPercent by main() as 100 - 90 = 10%
I cannot get what size parameters are for fftwf_plan_dft_r2c_2d
Input: a N rows by M cols matrix
Output: a N rows by floor(M/2) + 1 cols matrix ?
Are parameters input or output size?
Tried to give input size. This is what GDB sais
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1676.0x768]
0x637eed72 in n1fv_8 () from C:\devfiles\bin\libfftw3f-3.dll
(gdb) backtrace
#0 0x637eed72 in n1fv_8 () from C:\devfiles\bin\libfftw3f-3.dll
#1 0x7c91a000 in ntdll!RtlpUnWaitCriticalSection ()
from C:\WINDOWS\system32\ntdll.dll
No other thread uses fftw at the time.
The context:
Herbs::MatrixStorage<float> spectrum_in(frame_a.nRowsGet(),frame_a.nColsGet());
Herbs::MatrixStorage<std::complex<float>>
spectrum_out(frame_a.nRowsGet(),frame_a.nColsGet()/2+1);
Check for heap corruption issued by possible alllocation mistakes in MatrixStorage class . The rows in the matrix is one large block. rowGet returns the pointer to the given row.
memset(spectrum_in.rowGet(0),0
,sizeof(float)*spectrum_in.nRowsGet()*spectrum_in.nColsGet());
memset(spectrum_out.rowGet(0),0
,sizeof(float)*spectrum_out.nRowsGet()*spectrum_out.nColsGet());
heapdump(); //Heap seams to be fine after these
Causes sigsevg
FFT::PlanFloat_2dR2C plan(spectrum_in,spectrum_out);
The plan constructor does the following
plan=FFT::PlanFloat_2dR2C::PlanFloat_2dR2C(Herbs::MatrixStorage<InputType>& buffer_in
,Herbs::MatrixStorage<OutputType>& buffer_out)
{
plan=fftwf_plan_dft_r2c_2d
(
buffer_in.nRowsGet()
,buffer_in.nColsGet()
,buffer_in.rowGet(0)
,(fftwf_complex*)buffer_out.rowGet(0)
,FFTW_MEASURE
);
}
EDIT:
I used a precompiled DLL instead. GCC may produce bad code on 32-bit Windows (From release notes):
Removed an archaic stack-alignment hack that was failing with gcc-4.7/i386. Added stack-alignment hack necessary for gcc on Windows/i386. We will regret this in ten years (see previous change).
EDIT 2:
The dll became bad due to wrong options given to the configure script. The documentation is now updated.
You are correct. For an NxM float or double input you should allocate Nx(M/2+1) fftwf_complex or fftw_complex output.
The parameters are the input size. For example,
#include "fftw3.h"
#define Width 1024
#define Height 768
int main(){
float input[Width*Height];
fftwf_complex *fft = new fftwf_complex[((Width/2)+1)*Height];
fftwf_plan fplan = fftwf_plan_dft_r2c_2d(Height, Width,
(float*)input, fft, FFTW_ESTIMATE);
fftwf_execute(fplan);
fftwf_destroy_plan(fplan);
}
See Chapter 4.3 Basic Interface of the FFTW User Manual