Go performance penalty in high number of calls to append

Go performance penalty in high number of calls to append - go

I'm writing an emulator in Go, and for debugging purposes I'm logging the cpu' state at every emulator's cycle to generate a log file later.
There's something I'm not doing properly because while the logger is enabled performance drops and makes the emulator unusable.
Profiler shows clearly the culprit resides in the logging routine (logStep method):
logStep method is very simple, it calls CreateState to snapshot current cpu state in a struct, and then adds it to a slice (in method Log).
I call this method at every emulated cpu cycle (around 30.000 times per second), and I suspect either Garbage Collector is slowing my execution or I'm doing something wrong with this data structure.
I get the profile graph is pointing me to runtime growslice caused by an append located in (*cpu6502Logger)Log, but I'm unable to find information on how to do this more efficiently.
Also, I scratch my head on why CreateState takes that long to just create a simple struct.
This is what CpuState looks like:
type CpuState struct {
Registers Cpu6502Registers
CurrentInstruction Instruction
RawOpcode [4]byte
EvaluatedAddress Address
CyclesSinceReset uint32
}
This is how I create a CPU Snapshot:
func CreateState(cpu Cpu6502) CpuState {
pc := cpu.Registers().Pc
var rawOpcode [4]byte
rawOpcode[0] = 0x00
pc++
instruction := cpu.instructions[rawOpcode[0]]
for i := byte(0); i < (instruction.Size() - 1); i++ {
rawOpcode[1+i] = cpu.memory.Read(pc+Address(i))
}
_, evaluatedAddress, _, _ := cpu.addressEvaluators[instruction.AddressMode()](pc)
state := CpuState{
*cpu.Registers(),
instruction,
rawOpcode,
evaluatedAddress,
cpu.cycle,
}
return state
}
And finally, how I add this snapshot to a collection (log method in the profile graph). I've also addde how I initialize logger.snapshots:
func createCPULogger(outputPath string) cpu6502Logger {
return cpu6502Logger{
outputPath: outputPath,
snapshots: make([]CpuState, 0, 10024),
}
}
func (logger *cpu6502Logger) Log(state CpuState) {
logger.snapshots = append(logger.snapshots, state)
}

Disclaimer: following text contains grammar mistakes but i dont give a damn
why is it slow
Maintaining one gigantic slice to hold all data there is is wery costy mainly when it constantly extends. Each time you append few elements, whole memory section is copied to bigges section to allow expansion. with grownig slice, complexity grows and each realocation is slower and slower. You told us that you emulate tousands of cpu states per second.
solution
The best way to deal with this is allocating fixed buffer of some length. Now we now that eventually we will run out of space. When that happens we have two options. First you can write all data ftom buffer to file then truncate the buffer and start filling again (then write again). Other option is to save filled buffers in a slice and allocate new one. Choos witch one fits your machine. (slow or small ram is not good for second solution)
why does this help
i think this also helps the emulator it self. There will be performance spikes when restoring buffer, but most of the time, performance will be at maximum. Allocating big memory is just slow as alocator is less likely to find fitting section on first try. Garbage collection is also wery unhappy with frequent allocations. By allocating buffer and filling it, we use one big allocation, (but not too big), and store data in sections. Sections we already saved can stey where they are. We can also say that in this case we are handling memory our selfs more then gc does. (no garbage memory produced)

Related

What is the best way to evict unused records from string pool?

I am implementing a cache in Golang. Let's say the cache could be implemented as sync.Map with integer key and value as a struct:
type value struct {
fileName string
functionName string
}
Huge number of records have the same fileName and functionName. To save memory I want to use string pool. Go has immutable strings and my idea looks like:
var (
cache sync.Map
stringPool sync.Map
)
type value struct {
fileName string
functionName string
}
func addRecord(key int64, val value) {
fileName, _ := stringPool.LoadOrStore(val.fileName, val.fileName)
val.fileName = fileName.(string)
functionName, _ := stringPool.LoadOrStore(val.functionName, val.functionName)
val.functionName = functionName.(string)
cache.Store(key, val)
}
My idea is to keep every unique string (fileName and functionName) in memory once. Will it work?
Cache implementation must be concurrent safe. The number of records in the cache is about 10^8. The number of records in the string pool is about 10^6.
I have some logic that removes records from the cache. There is no problem with main cache size.
Could you please suggest how to manage string pool size?
I am thinking about storing reference count for every record in the string pool. It will require additional synchronizations or probably global locks to maintain it. I would like to implementation as simple as possible. You can see in my code snippet I don't use additional mutexes.
Or may be I need to follow completely different approach to minimize memory usage for my cache?

What you are trying to do with stringPool is commonly known as string interning. There are libraries like github.com/josharian/intern that provide "good enough" solutions to that kind of problem, and that do not require you to manually maintain the stringPool map. Note that no solution (including yours, assuming you eventually remove some elements from stringPool) can reliably deduplicate 100% of strings without incurring impractical levels of CPU overhead.
As a side note, it's worth pointing out that sync.Map is not really designed for update-heavy workloads. Depending on the keys used, you may actually experience significant contention when calling cache.Store. Furthermore, since sync.Map relies on interface{} for both keys and values, it normally incurs much more allocations that a plain map. Make sure to benchmark with realistic workloads to ensure that you picked the right approach.

Is it necessary to return pointer type in sync.Pool New function?

I saw the issue on Github which says sync.Pool should be used only with pointer types, for example:
var TPool = sync.Pool{
New: func() interface{} {
return new(T)
},
}
Does it make sense? What about return T{} and which is the better choice, why?

The whole point of sync.Pool is to avoid (expensive) allocations. Large-ish buffers, etc. You allocate a few buffers and they stay in memory, available for reuse. Hence the use of pointers.
But here you'll be copying the values on every step, defeating the purpose. (Assuming your T is a "normal" struct and not something like SliceHeader)

It is not necessary. In most cases it should be a pointer as you want to share an object, not to make copies.
In some use cases this can be a non pointer type, like an id of some external resource. I can imagine a pool of paths (mounted disk drives) represented with strings where some large file operations are being conducted.

Atomic operations in ARM

I've been working on an embedded OS for ARM, However there are a few things i didn't understand about the architecture even after referring to ARMARM and linux source.
Atomic operations.
ARM ARM says that Load and Store instructions are atomic and it's execution is guaranteed to be complete before interrupt handler executes. Verified by looking at
arch/arm/include/asm/atomic.h :
#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))
However, the problem comes in when i want to manipulate this value atomically using the cpu instructions (atomic_inc, atomic_dec, atomic_cmpxchg etc..) which use LDREX and STREX for ARMv7 (my target).
ARMARM doesn't say anything about interrupts being blocked in this section so i assume an interrupt can occur in between the LDREX and STREX. The thing it does mention is about locking the memory bus which i guess is only helpful for MP systems where there can be more CPUs trying to access same location at same time. But for UP (and possibly MP), If a timer interrupt (or IPI for SMP) fires in this small window of LDREX and STREX, Exception handler executes possibly changes cpu context and returns to the new task, however the shocking part comes in now, it executes 'CLREX' and hence removing any exclusive lock held by previous thread. So how better is using LDREX and STREX than LDR and STR for atomicity on a UP system ?
I did read something about an Exclusive lock monitor, so I've a possible theory that when the thread resumes and executes the STREX, the os monitor causes this call to fail which can be detected and the loop can be re-executed using the new value in the process (branch back to LDREX), Am i right here ?

The idea behind the load-linked/store-exclusive paradigm is that if if the store follows very soon after the load, with no intervening memory operations, and if nothing else has touched the location, the store is likely to succeed, but if something else has touched the location the store is certain to fail. There is no guarantee that stores will not sometimes fail for no apparent reason; if the time between load and store is kept to a minimum, however, and there are no memory accesses between them, a loop like:
do
{
new_value = __LDREXW(dest) + 1;
} while (__STREXW(new_value, dest));
can generally be relied upon to succeed within a few attempts. If computing the new value based on the old value required some significant computation, one should rewrite the loop as:
do
{
old_value = *dest;
new_value = complicated_function(old_value);
} while (CompareAndStore(dest, new_value, old_value) != 0);
... Assuming CompareAndStore is something like:
uint32_t CompareAndStore(uint32_t *dest, uint32_t new_value, uint_32 old_value)
{
do
{
if (__LDREXW(dest) != old_value) return 1; // Failure
} while(__STREXW(new_value, dest);
return 0;
}
This code will have to rerun its main loop if something changes *dest while the new value is being computed, but only the small loop will need to be rerun if the __STREXW fails for some other reason [which is hopefully not too likely, given that there will only be about two instructions between the __LDREXW and __STREXW]
Addendum
An example of a situation where "compute new value based on old" could be complicated would be one where the "values" are effectively a references to a complex data structure. Code may fetch the old reference, derive a new data structure from the old, and then update the reference. This pattern comes up much more often in garbage-collected frameworks than in "bare metal" programming, but there are a variety of ways it can come up even when programming bare metal. Normal malloc/calloc allocators are not generally thread-safe/interrupt-safe, but allocators for fixed-size structures often are. If one has a "pool" of some power-of-two number of data structures (say 255), one could use something like:
#define FOO_POOL_SIZE_SHIFT 8
#define FOO_POOL_SIZE (1 << FOO_POOL_SIZE_SHIFT)
#define FOO_POOL_SIZE_MASK (FOO_POOL_SIZE-1)
void do_update(void)
{
// The foo_pool_alloc() method should return a slot number in the lower bits and
// some sort of counter value in the upper bits so that once some particular
// uint32_t value is returned, that same value will not be returned again unless
// there are at least (UINT_MAX)/(FOO_POOL_SIZE) intervening allocations (to avoid
// the possibility that while one task is performing its update, a second task
// changes the thing to a new one and releases the old one, and a third task gets
// given the newly-freed item and changes the thing to that, such that from the
// point of view of the first task, the thing never changed.)
uint32_t new_thing = foo_pool_alloc();
uint32_t old_thing;
do
{
// Capture old reference
old_thing = foo_current_thing;
// Compute new thing based on old one
update_thing(&foo_pool[new_thing & FOO_POOL_SIZE_MASK],
&foo_pool[old_thing & FOO_POOL_SIZE_MASK);
} while(CompareAndSwap(&foo_current_thing, new_thing, old_thing) != 0);
foo_pool_free(old_thing);
}
If there will not often be multiple threads/interrupts/whatever trying to update the same thing at the same time, this approach should allow updates to be performed safely. If a priority relationship will exist among the things that may try to update the same item, the highest-priority one is guaranteed to succeed on its first attempt, the next-highest-priority one will succeed on any attempt that isn't preempted by the highest-priority one, etc. If one was using locking, the highest-priority task that wanted to perform the update would have to wait for the lower-priority update to finish; using the CompareAndSwap paradigm, the highest-priority task will be unaffected by the lower one (but will cause the lower one to have to do wasted work).

Okay, got the answer from their website.
If a context switch schedules out a process after the process has performed a Load-Exclusive but before it performs the Store-Exclusive, the Store-Exclusive returns a false negative result when the process resumes, and memory is not updated. This does not affect program functionality, because the process can retry the operation immediately.

How do you allocate memory at a predetermined location?

How do i allocate memory using new at a fixed location? My book says to do this:
char *buf=new char[sizeof(sample)];
sample *p=new(buf)sample(10,20);
Here new is allocating memory at buf's address, and (10,20) are values being passed. But what is sample? is it an address or a data type?

let me explain this code to you...
char *buf=new char[sizeof(sample)];
sample *p=new(buf)sample(10,20);
This is really four lines of code, written as two for your convenience. Let me just expand them
char *buf; // 1
buf = new char[sizeof(sample)]; // 2
sample *p; // 3
p = new(buf)sample(10,20); // 4
Line 1 and 3 are simple to explain, they are both declaring pointers. buf is a pointer to a char, p is a pointer to a sample. Now, we can not see what sample is, but we can assume that it is either a class defined else where, or some of data type that has been typedefed (more or less just given a new name) but either way, sample can be thought as a data type just link int or string
Line 2 is a allocating a block of memory and assigning it our char pointer called buf. Lets say sample was a class that contains 2 ints, this means it is (under most compilers) going to be 8 bytes (4 per int). And so buf points to the start of a block of memory that has been set aside to hold chars.
Line 4 is where it gets a big complex. if it where just p = new sample(10,20) it would be a simple case of creating a new object of type sample, passing it the two ints and storing the address of this new object in the pointer p. The addition of the (buf) is basically telling new to make use of the memory pointed to by buf.
The end effect is, you have one block of memory allocated (more or less 8 bytes) and it has two pointers pointing to it. One of the points, buf, is looking at that memory as 8 chars, that other, p, is looking at is a single sample.
Why would you do this?
Normally, you wouldn't. Modern C++ has made the sue of new rather redundant, there are many better ways to deal with objects. I guess that main reason for using this method, is if for some reason you want to keep a pool of memory allocated, as it can take time to get large blocks of memory and you might be able to save your self some time.
For the most part, if you think you need to do something like this, you are trying to solve the wrong thing
A Bit Extra
I do not have much experience with embedded or mobile devices, but I have never seen this used.
The code you posted is basically the same as just doing sample *p = new sample(10,20) neither method is controlling where the sample object is created.
Also consider that you do not always need to create objects dynamically using new.
void myFunction(){
sample p = sample(10,20);
}
This automatically creates a sample object for you. This method is much more preferable as it is easier to read and understand and you do not need to worry about deleting the object, it will be cleaned up for you when the function returns.
If you really do need to make use of dynamic objects, consider use of smart pointers, something like unique_ptr<sample> This will give you the ability to use dynamic object creation but save you the hassle of manual deleting the object of type sample (I can point you towards more info on this if you life)

It is a datatype or a typedef.

Can address space be recycled for multiple calls to MapViewOfFileEx without chance of failure?

Consider a complex, memory hungry, multi threaded application running within a 32bit address space on windows XP.
Certain operations require n large buffers of fixed size, where only one buffer needs to be accessed at a time.
The application uses a pattern where some address space the size of one buffer is reserved early and is used to contain the currently needed buffer.
This follows the sequence:
(initial run) VirtualAlloc -> VirtualFree -> MapViewOfFileEx
(buffer changes) UnMapViewOfFile -> MapViewOfFileEx
Here the pointer to the buffer location is provided by the call to VirtualAlloc and then that same location is used on each call to MapViewOfFileEx.
The problem is that windows does not (as far as I know) provide any handshake type operation for passing the memory space between the different users.
Therefore there is a small opportunity (at each -> in my above sequence) where the memory is not locked and another thread can jump in and perform an allocation within the buffer.
The next call to MapViewOfFileEx is broken and the system can no longer guarantee that there will be a big enough space in the address space for a buffer.
Obviously refactoring to use smaller buffers reduces the rate of failures to reallocate space.
Some use of HeapLock has had some success but this still has issues - something still manages to steal some memory from within the address space.
(We tried Calling GetProcessHeaps then using HeapLock to lock all of the heaps)
What I'd like to know is there anyway to lock a specific block of address space that is compatible with MapViewOfFileEx?
Edit: I should add that ultimately this code lives in a library that gets called by an application outside of my control

You could brute force it; suspend every thread in the process that isn't the one performing the mapping, Unmap/Remap, unsuspend the suspended threads. It ain't elegant, but it's the only way I can think of off-hand to provide the kind of mutual exclusion you need.

Have you looked at creating your own private heap via HeapCreate? You could set the heap to your desired buffer size. The only remaining problem is then how to get MapViewOfFileto use your private heap instead of the default heap.
I'd assume that MapViewOfFile internally calls GetProcessHeap to get the default heap and then it requests a contiguous block of memory. You can surround the call to MapViewOfFile with a detour, i.e., you rewire the GetProcessHeap call by overwriting the method in memory effectively inserting a jump to your own code which can return your private heap.
Microsoft has published the Detour Library that I'm not directly familiar with however. I know that detouring is surprisingly common. Security software, virus scanners etc all use such frameworks. It's not pretty, but may work:
HANDLE g_hndPrivateHeap;
HANDLE WINAPI GetProcessHeapImpl() {
return g_hndPrivateHeap;
}
struct SDetourGetProcessHeap { // object for exception safety
SDetourGetProcessHeap() {
// put detour in place
}
~SDetourGetProcessHeap() {
// remove detour again
}
};
void MapFile() {
g_hndPrivateHeap = HeapCreate( ... );
{
SDetourGetProcessHeap d;
MapViewOfFile(...);
}
}
These may also help:
How to replace WinAPI functions calls in the MS VC++ project with my own implementation (name and parameters set are the same)?
How can I hook Windows functions in C/C++?
http://research.microsoft.com/pubs/68568/huntusenixnt99.pdf

Imagine if I came to you with a piece of code like this:
void *foo;
foo = malloc(n);
if (foo)
free(foo);
foo = malloc(n);
Then I came to you and said, help! foo does not have the same address on the second allocation!
I'd be crazy, right?
It seems to me like you've already demonstrated clear knowledge of why this doesn't work. There's a reason that the documention for any API that takes an explicit address to map into lets you know that the address is just a suggestion, and it can't be guaranteed. This also goes for mmap() on POSIX.
I would suggest you write the program in such a way that a change in address doesn't matter. That is, don't store too many pointers to quantities inside the buffer, or if you do, patch them up after reallocation. Similar to the way you'd treat a buffer that you were going to pass into realloc().
Even the documentation for MapViewOfFileEx() explicitly suggests this:
While it is possible to specify an address that is safe now (not used by the operating system), there is no guarantee that the address will remain safe over time. Therefore, it is better to let the operating system choose the address. In this case, you would not store pointers in the memory mapped file, you would store offsets from the base of the file mapping so that the mapping can be used at any address.
Update from your comments
In that case, I suppose you could:
Not map into contiguous blocks. Perhaps you could map in chunks and write some intermediate function to decide which to read from/write to?
Try porting to 64 bit.

As the earlier post suggests, you can suspend every thread in the process while you change the memory mappings. You can use SuspendThread()/ResumeThread() for that. This has the disadvantage that your code has to know about all the other threads and hold thread handles for them.
An alternative is to use the Windows debug API to suspend all threads. If a process has a debugger attached, then every time the process faults, Windows will suspend all of the process's threads until the debugger handles the fault and resumes the process.
Also see this question which is very similar, but phrased differently:
Replacing memory mappings atomically on Windows

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio