boost beast async_write increases memory footprint dramatically

boost beast async_write increases memory footprint dramatically - c++11

I am currently experimenting with the boost beast library and now very surprised by it's memory footprint. I've found out by using three different response types (string, file, dynamic) the program size grows up to 6Mb.
To get closer to the cause, I took the small server example from the library and reduced it to the following steps:
class http_connection : public std::enable_shared_from_this<http_connection>
{
public:
http_connection(tcp::socket socket) : socket_(std::move(socket)) { }
void start() {
read_request();
}
private:
tcp::socket socket_;
beast::flat_buffer buffer_{8192};
http::request<http::dynamic_body> request_;
void read_request() {
auto self = shared_from_this();
http::async_read(
socket_, buffer_, request_,
[self](beast::error_code ec,
std::size_t bytes_transferred)
{
self->write_response(std::make_shared<http::response<http::dynamic_body>>());
self->write_response(std::make_shared<http::response<http::file_body>>());
self->write_response(std::make_shared<http::response<http::string_body>>(), true);
});
}
template <class T>
void write_response(std::shared_ptr<T> response, bool dostop=false) {
auto self = shared_from_this();
http::async_write(
socket_,
*response,
[self,response,dostop](beast::error_code ec, std::size_t)
{
if (dostop)
self->socket_.shutdown(tcp::socket::shutdown_send, ec);
});
}
};
when I comment out the three self->write_response lines and compile the program and execute the size command on the result, I get:
text data bss dec hex filename
343474 1680 7408 352562 56132 small
When I remove the comment of the first write, then I get:
864740 1714 7408 873862 d5586 small
text data bss dec hex filename
After removing all comments the final size become:
text data bss dec hex filename
1333510 1730 7408 1342648 147cb8 small
4,8M Feb 16 22:13 small*
The question now is:
Am I doing something wrong?
Is there a way to reduce the size?
UPDATE
the real process_request looks like:
void process_request() {
auto it = router.find(request.method(), request.target());
if (it != router.end()) {
auto response = it->getHandler()(doc_root_, request);
if (boost::apply_visitor(dsa::type::handler(), response) == TypeCode::dynamic_r) {
auto r = boost::get<std::shared_ptr<dynamic_response>>(response);
send(r);
return;
}
if (boost::apply_visitor(dsa::type::handler(), response) == TypeCode::file_r) {
auto r = boost::get<std::shared_ptr<file_response>>(response);
send(r);
return;
}
if (boost::apply_visitor(dsa::type::handler(), response) == TypeCode::string_r) {
auto r = boost::get<std::shared_ptr<string_response>>(response);
send(r);
return;
}
}
send(boost::get<std::shared_ptr<string_response>>(send_bad_response(
http::status::bad_request,
"Invalid request-method '" + std::string(req.method_string()) + "'\r\n")));
}
Thanks in advance

If you aren't actually leaking memory, then there is nothing wrong. Whatever memory is allocated by the system will either be reused for your program or eventually given back. It can be very difficult to measure the true memory usage of a program, especially under Linux, because of the virtual memory system. Unless you see an actual leak or real problem, I would ignore those memory reports and simply continue implementing your business logic. Beast itself contains no memory leaks (tested extensively per-commit on Travis and Appveyor under valgrind, asan, and ubsan).

Try use malloc_trim(0) , ex: in destructor of http_connection.
from man:
malloc_trim - release free memory from the top of the heap.
The malloc_trim() function attempts to release free memory at the top of the heap (by calling sbrk(2) with a suitable argument).
The pad argument specifies the amount of free space to leave untrimmed at the top of the heap.
If this argument is 0, only the minimum amount of memory is maintained at the top of the heap (i.e., one page or less). A nonzero argument can be used to maintain some trailing space at the top of the heap in order to allow future allocations to be made without having to extend the heap with
sbrk(2).

Related

ESP32 heap corruption error when releasing allocated memory

I am currently programming the ESP32 board in C++ and I am having trouble with my dataContainer class and releasing/allocating memory.
I do use the following DataContainer class (simplyfied):
template <typename Elementtype>
class DataContainer
{
private:
Elementtype **datalist;
int maxsize;
std::size_t currentsize; // How much data is saved in datalist
public:
DataContainer(int maxcapacity);
~DataContainer();
...some methods...
void reset_all_data();
};
And here is the reset_all_data() definition:
/* Deletes all Data of Datacontainer and allocates new memory*/
template <typename Elementtype>
void DataContainer<Elementtype>::reset_all_data()
{
for (int i = 0; i < currentsize; i++)
{
if (datalist[i])
Serial.println(heap_caps_check_integrity_all(true));
delete datalist[i]; <-- Error is triggered here!!!
Serial.println(heap_caps_check_integrity_all(true));
}
delete datalist;
datalist = new Elementtype *[maxsize];
for (int i = 0; i < maxsize; i++) // Declare a memory block of size maxsize (maxsize = 50)
{
datalist[i] = new Elementtype[5];
}
currentsize = 0;
}
As you can see, I have added some integrity checks, but the one before delete datalist (this seems to trigger the error). When I call reset_all_data() from my main.cpp at a certain point in my program the following error is triggered:
CORRUPT HEAP: Bad head at 0x3ffbb0f0. Expected 0xabba1234 got 0x3ffb9a34
assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)
Backtrace:0x40083881:0x3ffb25400x4008e7e5:0x3ffb2560 0x40093d55:0x3ffb2580 0x4009399b:0x3ffb26b0 0x40083d41:0x3ffb26d0
0x40093d85:0x3ffb26f0 0x4014e3f5:0x3ffb2710 0x400d2dc6:0x3ffb2730 0x400d31e3:0x3ffb2750 0x400d9b02:0x3ffb2820
One more thing, the error is only triggered when a certain function is called right before it, even when the whole code inside this function is commented. This is the function's head: void write_data_container_to_file(fs::FS &fs, const char *path, DataContainer<uint16_t> data, const char *RTC_timestamp)thus, the mere call of the function plays an import role here.
Right now I am completely lost - any suggestion/idea is welcome on how to proceed.
EDIT: The dataContainer holds a 2D array of uint16_t.

I finally tracked down the, rather obvious, reason for the HEAP CORRUPTION. In the end I only called delete datalist but it would have been correct to call delete[] datalist after the for loop. The reason is, that within the for loop I delete the pointers pointing to arrays, which represent the "rows" of my allcoated 2D memory. In the end, I also have to delete the pointer, which points to the array holding the pointers I deleted within the for loop.
So I was not paying attention and one should watch out that when it comes to releasing the previously allocated memory, care should be taken if delete or delete[]should be called.

Simulating (lazy) NAND memory on Windows

I'm running a firmware simulation in a DLL which has simulated NAND (256MB or 1GB). I want to avoid allocating memory for this on the heap and instead allocate using virtual memory.
The memory initially needs to be cleared to 0xFF (like NAND is). However I don't want to pay for that initialization (nor commit un-accessed pages). So ideally it should only allocate upon access. And I do not need to retain the data following exit of the simulation.
Initial ideas are
VirtualAlloc. Not sure but thinking perhaps could use guard page and then trap the exception on first access. Not sure its ideal that a DLL handles such SEH exceptions? Or is there a better way?
Create a big file that's initialized to 0xFF. Then map view of file with copy-on-write.
Anyone know if it is possible to create a file with a callback for providing the initial data?
Think probably 1) the way to go but wondering if that's really the best option.
Edit:
3) I've come up with another method that can avoid exception handler and also avoids creating a huge file:
Create a file that is same size as dwAllocationGranularity (64KiB typically). Fill with 0xFF. Then create multiple copy-on-write views of that in contiguous memory using MapViewOfFileEx + FILE_MAP_COPY (after an initial VirtualAlloc/VirtualFree to get a suitable base address that we can hope to allocate juxtapositioned views). Need to test this a bit more fully - slight concern about potential thread races.. I'm ony actually using a single thread but the CRT does start a few too.
This means that any code that only reads the virtual NAND also does not result in all pages getting committed.

yes, basically 1 is best solution. only i be do next changes - use VEH instead SEH - SEH handler will be called only if you access memory inside it, when in case VEH - access can be ai any context and thread. and instead use guard page, i be initial only reserve region of memory without real allocation. so any access to memory region lead to exception, you handle it in VEH - commit memory and fill with 0xFF pattern. demo code
PVOID g_NandBegin;
SIZE_T g_NandSize = 0x1000000;
LONG NTAPI Vex(::PEXCEPTION_POINTERS ExceptionInfo)
{
::PEXCEPTION_RECORD ExceptionRecord = ExceptionInfo->ExceptionRecord;
if (ExceptionRecord->ExceptionCode == STATUS_ACCESS_VIOLATION &&
ExceptionRecord->NumberParameters > 1)
{
PVOID pv = (PVOID)ExceptionRecord->ExceptionInformation[1];
if ((ULONG_PTR)pv - (ULONG_PTR)g_NandBegin < g_NandSize)
{
SIZE_T RegionSize = 1;
if (0 <= NtAllocateVirtualMemory(NtCurrentProcess(), &pv, 0, &RegionSize, MEM_COMMIT, PAGE_READWRITE))
{
RtlFillMemoryUlong(pv, RegionSize, MAXULONG);
return EXCEPTION_CONTINUE_EXECUTION;
}
}
}
return EXCEPTION_CONTINUE_SEARCH;
}
void dc()
{
if (PVOID pv = AddVectoredExceptionHandler(TRUE, Vex))
{
if (g_NandBegin = VirtualAlloc(0, g_NandSize, MEM_RESERVE, PAGE_READWRITE))
{
ULONG seed = ~GetTickCount();
int n = 0x100;
do
{
if (*(UCHAR*)((PBYTE)g_NandBegin + (((ULONG64)RtlRandomEx(&seed) * g_NandSize) >> 32)) != 0xFF)
{
__debugbreak();
}
} while (--n);
VirtualFree(g_NandBegin, 0, MEM_RELEASE);
}
RemoveVectoredExceptionHandler(pv);
}
}

Swift Dictionary Memory Consumption is Astronomical

Can anyone help shed some light on why the below code consumes well over 100 MB of RAM during runtime?
public struct Trie<Element : Hashable> {
private var children: [Element:Trie<Element>]
private var endHere : Bool
public init() {
children = [:]
endHere = false
}
public init<S : SequenceType where S.Generator.Element == Element>(_ seq: S) {
self.init(gen: seq.generate())
}
private init<G : GeneratorType where G.Element == Element>(var gen: G) {
if let head = gen.next() {
(children, endHere) = ([head:Trie(gen:gen)], false)
} else {
(children, endHere) = ([:], true)
}
}
private mutating func insert<G : GeneratorType where G.Element == Element>(var gen: G) {
if let head = gen.next() {
let _ = children[head]?.insert(gen) ?? { children[head] = Trie(gen: gen) }()
} else {
endHere = true
}
}
public mutating func insert<S : SequenceType where S.Generator.Element == Element>(seq: S) {
insert(seq.generate())
}
}
var trie = Trie<UInt32>()
for i in 0..<300000 {
trie.insert([UInt32(i), UInt32(i+1), UInt32(i+2)])
}
Based on my calculations total memory consumption for the above data structure should be somewhere around the following...
3 * count * sizeof(Trie<UInt32>)
Or –
3 * 300,000 * 9 = 8,100,000 bytes = ~8 MB
How is it that this data structure consumes well over 100 MB during runtime?

sizeof reports only the static footprint on the stack, which the Dictionary is just kind of a wrapper of the reference to its internal reference type implementation, and also the copy on write support. In other words, the key-value pairs and the hash table of your dictionary are allocated on the heap, which is not covered by sizeof. This applies to all other Swift collection types.
In your case, you are creating three Trie - and indirectly three dictionaries - every iteration of the 300000. I wouldn't be surprised if the 96-byte allocations mentioned by #Macmade is the minimum overhead of a dictionary (e.g. its hash bucket).
There might also be cost related to growing storage. So you may try to see if setting a minimumCapacity on the dictionary would help. On the other hand, if you do not need a divergent path generated per iteration, you may consider an indirect enum as an alternative, e.g.
public enum Trie<Element> {
indirect case Next(Element, Trie<Element>)
case End
}
which should use less memory.

Size of your struct is 9 bytes, not 5.
You can check it with sizeof:
let size = sizeof( Trie< UInt32 > );
Also, you iterate 300'000 times, but insert 3 values (of course, it's a trie). So that's 900'000.
Anyway, that does not explain by itself the memory consumption you are observing.
I'm not really fluent in Swift, and I don't understand you code.
Maybe there's also some error in it, making it allocate more memory than needed.
But anyway, in order to understand what's happening, you need to run your code in Instruments (command-i).
On my machine, I can see 900'000 96 bytes allocations by swift_slowAlloc.
That's more like it...
Why 96 bytes, assuming there's no error in your code?
Well, it might be because of the way memory is allocated for your elements.
When satisfying a request, the memory allocator may allocate more memory than requested. That may be because it needs some internal metadata, because of paging, because of alignment, ...
But even though, it seems really exaggerated, so use instruments and double check what your code is doing.

When json-c json_object_to_json_string release the memory

I am using json-c library to send json-object to client.And I notice there is no native function to release the memory which json_object_to_json_string allocate.Does the library release it automaticlly? OR I have to "free(str)" to avoid memory leak?
I tried to read its source code but it makes me unconscious...So anybody know this?

It seems that you don't need to free it manually.
I see that this buffer comes from within the json_object (see the last line of this function):
const char* json_object_to_json_string_ext(struct json_object *jso, int flags)
{
if (!jso)
return "null";
if ((!jso->_pb) && !(jso->_pb = printbuf_new()))
return NULL;
printbuf_reset(jso->_pb);
if(jso->_to_json_string(jso, jso->_pb, 0, flags) < 0)
return NULL;
return jso->_pb->buf;
}
The delete function frees this buffer:
static void json_object_generic_delete(struct json_object* jso)
{
#ifdef REFCOUNT_DEBUG
MC_DEBUG("json_object_delete_%s: %p\n",
json_type_to_name(jso->o_type), jso);
lh_table_delete(json_object_table, jso);
#endif /* REFCOUNT_DEBUG */
printbuf_free(jso->_pb);
free(jso);
}
It is important to understand that this buffer is only valid while the object is valid. If the object reaches 0 reference count, the string is also freed and if you are using it after it is freed the results are unpredictable.

How to put my structure variable into CPU caches to eliminate main memory page access time? Options

It's clear that there is no explicit way or certain system calls that
help programmers to put a variable into the CPU cache.
But I think that a certain programming style or well designed
algorithm can make it possible to increase the possibilities that the
variable can be cached into the CPU caches.
Here is my example:
I want to append an 8 byte structure at the end of an array consisting
of the same type of structures, declared in the global main memory
region.
This process is continuously repeated for 4 million operations. This process takes 6 seconds, 1.5 us for each operation. I think this result tells that the two memory areas have not been cached.
I got some clues from a cache-oblivious algorithm, so I tried several
ways to enhance this. Until now, no enhancement.
I think some clever codes can reduce the elapsed time, up to 10 to 100
times. Please show me the way.
-------------------------------------------------------------------------
Appended (2011-04-01)
Damon~ thank you for your comment!
After reading your comment, I analyzed my code again, and found several things
that I missed. The following code that I attached is the abbreviated version of my original code.
To accurately measure each operation's execution time (in the original code, there are several different types of operations), I inserted the time measuring code using clock_gettime() function. I thought if I measure each operation's execution time and accumulate them, the additional cost by the main loop can be avoided.
In the original code, the time measuring code was hidden by a macro function, so I totally forgot about it.
The running time of this code is almost 6 seconds. But if I get rid of the time measuring function in the main loop, it becomes 0.1 seconds.
Since the clock_gettime() function supports very high precision (upto 1 nano second), executed on the basis of an independent thread, and also it requires very big structure,
I think the function caused the cache-out of the main memory area where the consecutive insertions are performed.
Thank you again for your comment. For further enhancement, any suggestion will be very helpful for me to optimize my code.
I think the hierachically defined structure variable might cause unnecessary time cost,
but first I want to know how much it would be, before I change it to the more C-style code.
typedef struct t_ptr {
uint32 isleaf :1, isNextLeaf :1, ptr :30;
t_ptr(void) {
isleaf = false;
isNextLeaf = false;
ptr = NIL;
}
} PTR;
typedef struct t_key {
uint32 op :1, key :31;
t_key(void) {
op = OP_INS;
key = 0;
}
} KEY;
typedef struct t_key_pair {
KEY key;
PTR ptr;
t_key_pair() {
}
t_key_pair(KEY k, PTR p) {
key = k;
ptr = p;
}
} KeyPair;
typedef struct t_op {
KeyPair keyPair;
uint seq;
t_op() {
seq = 0;
}
} OP;
#define MAX_OP_LEN 4000000
typedef struct t_opq {
OP ops[MAX_OP_LEN];
int freeOffset;
int globalSeq;
bool queueOp(register KeyPair keyPair);
} OpQueue;
bool OpQueue::queueOp(register KeyPair keyPair) {
bool isFull = false;
if (freeOffset == (int) (MAX_OP_LEN - 1)) {
isFull = true;
}
ops[freeOffset].keyPair = keyPair;
ops[freeOffset].seq = globalSeq++;
freeOffset++;
}
OpQueue opQueue;
#include <sys/time.h>
int main() {
struct timespec startTime, endTime, totalTime;
for(int i = 0; i < 4000000; i++) {
clock_gettime(CLOCK_REALTIME, &startTime);
opQueue.queueOp(KeyPair());
clock_gettime(CLOCK_REALTIME, &endTime);
totalTime.tv_sec += (endTime.tv_sec - startTime.tv_sec);
totalTime.tv_nsec += (endTime.tv_nsec - startTime.tv_nsec);
}
printf("\n elapsed time: %ld", totalTime.tv_sec * 1000000LL + totalTime.tv_nsec / 1000L);
}

YOU don't put the structure into any cache. The CPU does that automatically for you. The CPU is even more clever than that; if you access sequential memory, it will start putting things from memory into the cache before you read them.
And really, it should be common sense that for a simple bit of code like this, the time you spend on measuring is ten times more than the time to perform the code (apparently 60 times in your case).
Since you put so much confidence in clock_gettime (): I suggest you call it five times in a row and store the results, then print the differences. There's resolution, there's precision, and there's how long it takes to return the current time, which is pretty damned long.

I have been unable to force caching, but you can force memory to be uncache-able. If you have large other datastructures you might exclude these so that they will not pollute your caches. This can be done by specifying PAGE_NOCACHE for the Windows VirutalAllocXXX functions.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366786(v=vs.85).aspx

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio