How should I initialize an `Arc<[u8; 65536]>` efficiently? - performance

I'm writing an application creating Arc objects of large arrays:
use std::sync::Arc
let buffer: Arc<[u8; 65536]> = Arc::new([0u8; 65536]);
After profiling this code, I've found that a memmove is occurring, making this slow. With other calls to Arc::new, the compiler seems smart enough to initialize the stored data without the memmove.
Believe it or not, the above code is faster than:
use std::sync::Arc;
use std::mem;
let buffer: Arc<[u8; 65536]> = Arc::new(unsafe {mem::uninitialized})
Which is a bit of a surprise.
Insights welcome, I expect this is a compiler issue.

Yeah, right now, you have to lean on optimizations, and apparently, it isn't doing it in this case. I'm not sure why.
We are also still working on placement new functionality, which will be able to let you explicitly tell the compiler you want to initialize this on the heap directly. See https://github.com/rust-lang/rfcs/pull/809 (and https://github.com/rust-lang/rfcs/pull/1228 which proposes changes that are inconsequential for this question). Once this is implemented, this should work:
let buffer: Arc<_> = box [0u8; 65536];

Related

Proper way to manipulate registers (PUT32 vs GPIO->ODR)

I'm learning how to use microcontrollers without a bunch of abstractions. I've read somewhere that it's better to use PUT32() and GET32() instead of volatile pointers and stuff. Why is that?
With a basic pin wiggle "benchmark," the performance of GPIO->ODR=0xFFFFFFFF seems to be about four times faster than PUT32(GPIO_ODR, 0xFFFFFFFF), as shown by the scope:
(The one with lower frequency is PUT32)
This is my code using PUT32
PUT32(0x40021034, 0x00000002); // RCC IOPENR B
PUT32(0x50000400, 0x00555555); // PB MODER
while (1) {
PUT32(0x50000414, 0x0000FFFF); // PB ODR
PUT32(0x50000414, 0x00000000);
}
This is my code using the arrow thing
* (volatile uint32_t *) 0x40021034 = 0x00000002; // RCC IOPENR B
GPIOB->MODER = 0x00555555; // PB MODER
while (1) {
GPIOB->ODR = 0x00000000; // PB ODR
GPIOB->ODR = 0x0000FFFF;
}
I shamelessly adapted the assembly for PUT32 from somewhere
PUT32 PROC
EXPORT PUT32
STR R1,[R0]
BX LR
ENDP
My questions are:
Why is one method slower when it looks like they're doing the same thing?
What's the proper or best way to interact with GPIO? (Or rather what are the pros and cons of different methods?)
Additional information:
Chip is STM32G031G8Ux, using Keil uVision IDE.
I didn't configure the clock to go as fast as it can, but it should be consistent for the two tests.
Here's my hardware setup: (Scope probe connected to the LEDs. The extra wires should have no effect here)
Thank you for your time, sorry for any misunderstandings
PUT32 is a totally non-standard method that the poster in that other question made up. They have done this to avoid the complication and possible mistakes in defining the register access methods.
When you use the standard CMSIS header files and assign to the registers in the standard way, then all the complication has already been taken care of for you by someone who has specific knowledge of the target that you are using. They have designed it in a way that makes it hard for you to make the mistakes that the PUT32 is trying to avoid, and in a way that makes the final syntax look cleaner.
The reason that writing to the registers directly is quicker is because writing to a register can take as little as a single cycle of the processor clock, whereas calling a function and then writing to the register and then returning takes four times longer in the context of your experiment.
By using this generic access method you also risk introducing bugs that are not possible if you used the manufacturer provided header files: for example using a 32 bit access when the register is 16 or 8 bits.

Stimulate code-inlining

Unlike in languages like C++, where you can explicitly state inline, in Go the compiler dynamically detects functions that are candidate for inlining (which C++ can do too, but Go can't do both). Also there's a debug option to see possible inlining happening, yet there is very few documented online about the exact logic of the go compiler(s) doing this.
Let's say I need to rerun some big loop over a set of data every n-period;
func Encrypt(password []byte) ([]byte, error) {
return bcrypt.GenerateFromPassword(password, 13)
}
for id, data := range someDataSet {
newPassword, _ := Encrypt([]byte("generatedSomething"))
data["password"] = newPassword
someSaveCall(id, data)
}
Aiming for example for Encrypt to being inlined properly what logic should I need to take into consideration for the compiler?
I know from C++ that passing by reference will increase likeliness for automatic inlining without the explicit inline keyword, but it's not very easy to understand what the compiler exactly does to determine the decisions on choosing to inline or not in Go. Scriptlanguages like PHP for example suffer immensely if you do a loop with a constant addSomething($a, $b) where benchmarking such a billion cycles the cost of it versus $a + $b (inline) is almost ridiculous.
Until you have performance problems, you shouldn't care. Inlined or not, it will do the same.
If performance does matter and it makes a noticable and significant difference, then don't rely on current (or past) inlining conditions, "inline" it yourself (do not put it in a separate function).
The rules can be found in the $GOROOT/src/cmd/compile/internal/inline/inl.go file. You may control its aggressiveness with the 'l' debug flag.
// The inlining facility makes 2 passes: first caninl determines which
// functions are suitable for inlining, and for those that are it
// saves a copy of the body. Then InlineCalls walks each function body to
// expand calls to inlinable functions.
//
// The Debug.l flag controls the aggressiveness. Note that main() swaps level 0 and 1,
// making 1 the default and -l disable. Additional levels (beyond -l) may be buggy and
// are not supported.
// 0: disabled
// 1: 80-nodes leaf functions, oneliners, panic, lazy typechecking (default)
// 2: (unassigned)
// 3: (unassigned)
// 4: allow non-leaf functions
//
// At some point this may get another default and become switch-offable with -N.
//
// The -d typcheckinl flag enables early typechecking of all imported bodies,
// which is useful to flush out bugs.
//
// The Debug.m flag enables diagnostic output. a single -m is useful for verifying
// which calls get inlined or not, more is for debugging, and may go away at any point.
Also check out blog post: Dave Cheney - Five things that make Go fast (2014-06-07) which writes about inlining (long post, it's about in the middle, search for the "inline" word).
Also interesting discussion about inlining improvements (maybe Go 1.9?): cmd/compile: improve inlining cost model #17566
Better still, don’t guess, measure!
You should trust the compiler and avoid trying to guess its inner workings as it will change from one version to the next.
There are far too many tricks the compiler, the CPU or the cache can play to be able to predict performance from source code.
What if inlining makes your code bigger to the point that it doesn’t fit in the cache line anymore, making it much slower than the non-inlined version? Cache locality can have a much bigger impact on performance than branching.

Protecting memory from changing

Is there a way to protect an area of the memory?
I have this struct:
#define BUFFER 4
struct
{
char s[BUFFER-1];
const char zc;
} str = {'\0'};
printf("'%s', zc=%d\n", str.s, str.zc);
It is supposed to operate strings of lenght BUFFER-1, and garantee that it ends in '\0'.
But compiler gives error only for:
str.zc='e'; /*error */
Not if:
str.s[3]='e'; /*no error */
If compiling with gcc and some flag might do, that is good as well.
Thanks,
Beco
To detect errors at runtime take a look at the -fstack-protector-all option in gcc. It may be of limited use when attempting to detect very small overflows like the one your described.
Unfortunately you aren't going to find a lot of info on detecting buffer overflow scenarios like the one you described at compile-time. From a C language perspective the syntax is totally correct, and the language gives you just enough rope to hang yourself with. If you really want to protect your buffers from yourself you can write a front-end to array accesses that validates the index before it allows access to the memory you want.

D Dynamic Arrays - RAII

I admit I have no deep understanding of D at this point, my knowledge relies purely on what documentation I have read and the few examples I have tried.
In C++ you could rely on the RAII idiom to call the destructor of objects on exiting their local scope.
Can you in D?
I understand D is a garbage collected language, and that it also supports RAII.
Why does the following code not cleanup the memory as it leaves a scope then?
import std.stdio;
void main() {
{
const int len = 1000 * 1000 * 256; // ~1GiB
int[] arr;
arr.length = len;
arr[] = 99;
}
while (true) {}
}
The infinite loop is there so as to keep the program open to make residual memory allocations easy visible.
A comparison of a equivalent same program in C++ is shown below.
It can be seen that C++ immediately cleaned up the memory after allocation (the refresh rate makes it appear as if less memory was allocated), whereas D kept it even though it had left scope.
Therefore, when does the GC cleanup?
scope declarations are going in D2, so I'm not terribly certain on the semantics, but what I'd imagine is happening is that scope T[] a; only allocates the array struct on the stack (which needless to say, already happens, regardless of scope). As they are going, don't use scope (using scope(exit) and friends is different -- keep using them).
Dynamic arrays always use the GC to allocate their memory -- there's no getting around that. If you want something more deterministic, using std.container.Array would be the simplest manner, as I think you could pretty much drop it in where your scope vector3b array is:
Array!vector3b array
Just don't bother setting the length to zero -- the memory will be free'd once it goes out of scope (Array uses malloc/free from libc under the hood).
No, you cannot assume that the garbage collector will collect your object at any point in time.
There is, however, a delete keyword (as well as a scope keyword) that can delete an object deterministically.
scope is used like:
{
scope auto obj = new int[5];
//....
} //obj cleaned up here
and delete is used like in C++ (there's no [] notation for delete).
There are some gotcha's, though:
It doesn't always work properly (I hear it doesn't work well with arrays)
The developers of D (e.g. Andrei) are intending to remove them in later versions, because it can obviously mess up things if used incorrectly. (I personally hate this, given that it's so easy to screw things up anyway, but they're sticking with removing it, and I don't think people can convince them otherwise although I'd love it if that was the case.)
In its place, there is already a clear method that you can use, like arr.clear(); however, I'm not quite sure what it exactly does yet myself, but you could look at the source code in object.d in the D runtime if you're interested.
As to your amazement: I'm glad you're amazed, but it shouldn't be really surprising considering that they're both native code. :-)

Some Windows API calls fail unless the string arguments are in the system memory rather than local stack

We have an older massive C++ application and we have been converting it to support Unicode as well as 64-bits. The following strange thing has been happening:
Calls to registry functions and windows creation functions, like the following, have been failing:
hWnd = CreateSysWindowExW( ExStyle, ClassNameW.StringW(), Label2.StringW(), Style,
Posn.X(), Posn.Y(),
Size.X(), Size.Y(),
hParentWnd, (HMENU)Id,
AppInstance(), NULL);
ClassNameW and Label2 are instances of our own Text class which essentially uses malloc to allocate the memory used to store the string.
Anyway, when the functions fail, and I call GetLastError it returns the error code for "invalid memory access" (though I can inspect and see the string arguments fine in the debugger). Yet if I change the code as follows then it works perfectly fine:
BSTR Label2S = SysAllocString(Label2.StringW());
BSTR ClassNameWS = SysAllocString(ClassNameW.StringW());
hWnd = CreateSysWindowExW( ExStyle, ClassNameWS, Label2S, Style,
Posn.X(), Posn.Y(),
Size.X(), Size.Y(),
hParentWnd, (HMENU)Id,
AppInstance(), NULL);
SysFreeString(ClassNameWS); ClassNameWS = 0;
SysFreeString(Label2S); Label2S = 0;
So what gives? Why would the original functions work fine with the arguments in local memory, but when used with Unicode, the registry function require SysAllocString, and when used in 64-bit, the Windows creation functions also require SysAllocString'd string arguments? Our Windows procedure functions have all been converted to be Unicode, always, and yes we use SetWindowLogW call the correct default Unicode DefWindowProcW etc. That all seems to work fine and handles and draws Unicode properly etc.
The documentation at http://msdn.microsoft.com/en-us/library/ms632679%28v=vs.85%29.aspx does not say anything about this. While our application is massive we do use debug heaps and tools like Purify to check for and clean up any memory corruption. Also at the time of this failure, there is still only one main system thread. So it is not a thread issue.
So what is going on? I have read that if string arguments are marshalled anywhere or passed across process boundaries, then you have to use SysAllocString/BSTR, yet we call lots of API functions and there is lots of code out there which calls these functions just using plain local strings?
What am I missing? I have tried Googling this, as someone else must have run into this, but with little luck.
Edit 1: Our StringW function does not create any temporary objects which might go out of scope before the actual API call. The function is as follows:
Class Text {
const wchar_t* StringW () const
{
return TextStartW;
}
wchar_t* TextStartW; // pointer to current start of text in DataArea
I have been running our application with the debug heap and memory checking and other diagnostic tools, and found no source of memory corruption, and looking at the assembly, there is no sign of temporary objects or invalid memory access.
BUT I finally figured it out:
We compile our code /Zp1, which means byte aligned memory allocations. SysAllocString (in 64-bits) always return a pointer that is aligned on a 8 byte boundary. Presumably a 32-bit ANSI C++ application goes through an API layer to the underlying Unicode windows DLLs, which would also align the pointer for you.
But if you use Unicode, you do not get that incidental pointer alignment that the conversion mapping layer gives you, and if you use 64-bits, of course the situation will get even worse.
I added a method to our Text class which shifts the string pointer so that it is aligned on an eight byte boundary, and viola, everything runs fine!!!
Of course the Microsoft people say it must be memory corruption and I am jumping the wrong conclusion, but there is evidence it is not the case.
Also, if you use /Zp1 and include windows.h in a 64-bit application, the debugger will tell you sizeof(BITMAP)==28, but calling GetObject on a bitmap will fail and tell you it needs a 32-byte structure. So I suspect that some of Microsoft's API is inherently dependent on aligned pointers, and I also know that some optimized assembly (I have seen some from Fortran compilers) takes advantage of that and crashes badly if you ever give it unaligned pointers.
So the moral of all of this is, dont use "funky" compiler arguments like /Zp1. In our case we have to for historical reasons, but the number of times this has bitten us...
Someone please give me a "this is useful" tick on my answer please?
Using a bit of psychic debugging, I'm going to guess that the strings in your application are pooled in a read-only section.
It's possible that the CreateSysWindowsEx is attempting to write to the memory passed in for the window class or title. That would explain why the calls work when allocated on the heap (SysAllocString) but not when used as constants.
The easiest way to investigate this is to use a low level debugger like windbg - it should break into the debugger at the point where the access violation occurs which should help figure out the problem. Don't use Visual Studio, it has a nasty habit of being helpful and hiding first chance exceptions.
Another thing to try is to enable appverifier on your application - it's possible that it may show something.
Calling a Windows API function does not cross the process boundary, since the various Windows DLLs are loaded into your process.
It sounds like whatever pointer that StringW() is returning isn't valid when Windows is trying to access it. I would look there - is it possible that the pointer returned it out of scope and deleted shortly after it is called?
If you share some more details about your string class, that could help diagnose the problem here.

Resources