Understand Linux Kernel meminfo from mem_init_print_info() - memory-management

I am trying to understand the fields printed from the Linux Kernel - mm/page_alloc.c - mem_init_print_info()
pr_info("Memory: %luK/%luK available (%luK kernel code, %luK rwdata, %luK rodata, %luK init, %luK bss, %luK reserved, %luK cma-reserved", nr_free_pages() << (PAGE_SHIFT - 10), physpages << (PAGE_SHIFT - 10), codesize >> 10, datasize >> 10, rosize >> 10, (init_data_size + init_code_size) >> 10, bss_size >> 10, (physpages - totalram_pages() - totalcma_pages) << (PAGE_SHIFT - 10), totalcma_pages << (PAGE_SHIFT - 10)
First question is why is the fields printed after shifting it 10 times. ?
What are codesize, datasize, rosize, init_data_size, init_code_size, physpages, totalram_pages, totalcma_pages?

Related

ARM MMU Debugging

I have been working on a bare metal raspberry pi project, and I am now attempting to initialize the memory management unit, following the documentation and examples.
When I run it, however, on the pi, nothing happens afterwards, and when I run it in QEMU using gdb, gdb either crashes of I get a Prefetch Abort as exception 3. Have I incorrectly set some properties such as shareability, or incorrectly used isb, or is there something I have missed?
Here is my code below
pub unsafe fn init_mmu(start: usize) {
let tcr_el1 =
(0b10 << 30) | //4kb Granule
(0b10 << 28) | //TTBR1 outer shareable
(25 << 16) | //TTBR1 size
(0b10 << 12) | //TTBR0 outer shareable
25; //TTBR0 size
let sctlr_el1 =
(1 << 4) | //Enforce EL0 stack alignment
(1 << 3) | //Enforce EL1 stack alignment
(1 << 1) | // Enforce access alignment
1; //Enable MMU
//0000_0000: nGnRnE device memory
// 0100_0100: non cacheable
let mair = 0b0000_0000_0100_0100;
let mut table = core::slice::from_raw_parts_mut(start as *mut usize, 2048);
for i in 0..(512-8) {
table[i] = (i <<9) |
(1 << 10) | // AF
(0 << 2) | //MAIR index
1; //Block entry
}
for i in (512-8)..512 {
table[i] = (i << 9) |
(1 << 10) | // AF
(1 << 2) | //MAIR index
1; //Block entry
}
table[512] = (512 << 9) |
(1 << 10) | // AF
(1 << 2) | //MAIR index
1; //Block entry
table[1024] = (0x8000000000000000) | start | 3;
table[1025] = (0x8000000000000000) | (start + 512 * 64) | 3;
write!("mair_el1", mair);
write!("ttbr0_el1", start + 1024 * 64);
asm!("isb");
write!("tcr_el1", tcr_el1);
asm!("isb");
write!("sctlr_el1", sctlr_el1);
asm!("isb");
}

Why are opencl enqueue commands so time consuming?

I am trying to do live audio processing using the Intel HD Graphics GPU. I theory they should be perfect for this. But I am surprised at cost of the enqueuing commands. This looks to be a prohibiting factor, and by far the most time-consuming step.
In short calling the enqueueXXXXX commands take a long time. Actually doing the data copying and executing the kernel is sufficiently fast. Is this just an inherent problem with the OpenCL implementation, or am I doing something wrong?
Data copying + kernel execution takes about 10us
Calling the enqueue commands takes about 300us - 500us
The code is available at https://github.com/tblum/opencl_enqueue/blob/master/main.cpp
for (int i = 0; i < 10; ++i) {
cl::Event copyToEvent;
cl::Event copyFromEvent;
cl::Event kernelEvent;
auto t1 = Clock::now();
commandQueue.enqueueWriteBuffer(clIn, CL_FALSE, 0, 10 * 48 * sizeof(float), frameBufferIn, nullptr, &copyToEvent);
OCLdownMix.setArg(0,clIn);
OCLdownMix.setArg(1,clOut);
OCLdownMix.setArg(2,(unsigned int)480);
commandQueue.enqueueNDRangeKernel(OCLdownMix, cl::NullRange, cl::NDRange(480), cl::NDRange(48), nullptr, &kernelEvent);
commandQueue.enqueueReadBuffer(clOut, CL_FALSE, 0, 10 * 48 * sizeof(float), clResult, nullptr, &copyFromEvent);
auto t2 = Clock::now();
commandQueue.finish();
auto t3 = Clock::now();
cl_ulong copyToTime = copyToEvent.getProfilingInfo<CL_PROFILING_COMMAND_END>() -
copyToEvent.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_ulong kernelTime = kernelEvent.getProfilingInfo<CL_PROFILING_COMMAND_END>() -
kernelEvent.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_ulong copyFromTime = copyFromEvent.getProfilingInfo<CL_PROFILING_COMMAND_END>() -
copyFromEvent.getProfilingInfo<CL_PROFILING_COMMAND_START>();
std::cout << "Enqueue: " << t2 - t1 << ", Total: " << t3 - t1 << ", GPU: " << (copyToTime+kernelTime+copyFromTime) / 1000.0 << "us"<< std::endl;
}
Output:
Enqueue: 1804us, Total: 4322us, GPU: 10.832us
Enqueue: 485us, Total: 668us, GPU: 10.666us
Enqueue: 237us, Total: 419us, GPU: 10.499us
Enqueue: 282us, Total: 474us, GPU: 10.832us
Enqueue: 345us, Total: 531us, GPU: 10.082us
Enqueue: 359us, Total: 555us, GPU: 10.915us
Enqueue: 345us, Total: 524us, GPU: 10.082us
Enqueue: 327us, Total: 504us, GPU: 10.416us
Enqueue: 363us, Total: 540us, GPU: 10.333us
Enqueue: 442us, Total: 595us, GPU: 10.916us
I found this related question: How to reduce OpenCL enqueue time/any other ideas?
But no useful answers for my situation.
Any help or ideas would be appreciated.
Thanks
BR Troels

SIGXCPU raised by setrlimit RLIMIT_CPU later than expected in a virtual machine

[EDIT: added MCVE in the text, clarifications]
I have the following program that sets RLIMIT_CPU to 2 seconds using setrlimit() and catches the signal. RLIMIT_CPU limits CPU time. «When the process reaches the soft limit, it is sent a SIGXCPU signal. The default action for this signal is to terminate the process. However, the signal can be caught, and the handler can return control to the main program.» (man)
The following program sets RLIMIT_CPU and a signal handler for SIGXCPU, then it generates random numbers until SIGXCPU gets raised, the signal handler simply exits the program.
test_signal.cpp
/*
* Test program for signal handling on CMS.
*
* Compile with:
* /usr/bin/g++ [-DDEBUG] -Wall -std=c++11 -O2 -pipe -static -s \
* -o test_signal test_signal.cpp
*
* The option -DDEBUG activates some debug logging in the helpers library.
*/
#include <iostream>
#include <fstream>
#include <random>
#include <chrono>
#include <iostream>
#include <unistd.h>
#include <csignal>
#include <sys/time.h>
#include <sys/resource.h>
using namespace std;
namespace helpers {
long long start_time = -1;
volatile sig_atomic_t timeout_flag = false;
unsigned const timelimit = 2; // soft limit on CPU time (in seconds)
void setup_signal(void);
void setup_time_limit(void);
static void signal_handler(int signum);
long long get_elapsed_time(void);
bool has_reached_timeout(void);
void setup(void);
}
namespace {
unsigned const minrand = 5;
unsigned const maxrand = 20;
int const numcycles = 5000000;
};
/*
* Very simple debugger, enabled at compile time with -DDEBUG.
* If enabled, it prints on stderr, otherwise it does nothing (it does not
* even evaluate the expression on its right-hand side).
*
* Main ideas taken from:
* - C++ enable/disable debug messages of std::couts on the fly
* (https://stackoverflow.com/q/3371540/2377454)
* - Standard no-op output stream
* (https://stackoverflow.com/a/11826787/2377454)
*/
#ifdef DEBUG
#define debug true
#else
#define debug false
#endif
#define debug_logger if (!debug) \
{} \
else \
cerr << "[DEBUG] helpers::"
// conversion factor betwen seconds and nanoseconds
#define NANOS 1000000000
// signal to handle
#define SIGNAL SIGXCPU
#define TIMELIMIT RLIMIT_CPU
/*
* This could be a function factory where and a closure of the signal-handling
* function so that we could explicitly pass the output ofstream and close it.
* C++ support closures only for lambdas, alas, at the moment we also need
* the signal-handling function to be a pointer to a function and lambaa are
* a different object that can not be converted. See:
* - Passing lambda as function pointer
* (https://stackoverflow.com/a/28746827/2377454)
*/
void helpers::signal_handler(int signum) {
helpers::timeout_flag = true;
debug_logger << "signal_handler:\t" << "signal " << signum \
<< " received" << endl;
debug_logger << "signal_handler:\t" << "exiting after " \
<< helpers::get_elapsed_time() << " microseconds" << endl;
exit(0);
}
/*
* Set function signal_handler() as handler for SIGXCPU using sigaction. See
* - https://stackoverflow.com/q/4863420/2377454
* - https://stackoverflow.com/a/17572787/2377454
*/
void helpers::setup_signal() {
debug_logger << "set_signal:\t" << "set_signal() called" << endl;
struct sigaction new_action;
//Set the handler in the new_action struct
new_action.sa_handler = signal_handler;
// Set to empty the sa_mask. It means that no signal is blocked
// while the handler run.
sigemptyset(&new_action.sa_mask);
// Block the SIGXCPU signal, while the handler run, SIGXCPU is ignored.
sigaddset(&new_action.sa_mask, SIGNAL);
// Remove any flag from sa_flag
new_action.sa_flags = 0;
// Set new action
sigaction(SIGNAL,&new_action,NULL);
if(debug) {
struct sigaction tmp;
// read the old signal associated to SIGXCPU
sigaction(SIGNAL, NULL, &tmp);
debug_logger << "set_signal:\t" << "action.sa_handler: " \
<< tmp.sa_handler << endl;
}
return;
}
/*
* Set soft CPU time limit.
* RLIMIT_CPU set teg CPU time limit in seconds..
* See:
* - https://www.go4expert.com/articles/
* getrlimit-setrlimit-control-resources-t27477/
* - https://gist.github.com/Leporacanthicus/11086960
*/
void helpers::setup_time_limit(void) {
debug_logger << "set_limit:\t\t" << "set_limit() called" << endl;
struct rlimit limit;
if(getrlimit(TIMELIMIT, &limit) != 0) {
perror("error calling getrlimit()");
exit(EXIT_FAILURE);
}
limit.rlim_cur = helpers::timelimit;
if(setrlimit(TIMELIMIT, &limit) != 0) {
perror("error calling setrlimit()");
exit(EXIT_FAILURE);
}
if (debug) {
struct rlimit tmp;
getrlimit(TIMELIMIT, &tmp);
debug_logger << "set_limit:\t\t" << "current limit: " << tmp.rlim_cur \
<< " seconds" << endl;
}
return;
}
void helpers::setup(void) {
struct timespec start;
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start)) {
exit(EXIT_FAILURE);
}
start_time = start.tv_sec*NANOS + start.tv_nsec;
setup_signal();
setup_time_limit();
return;
}
long long helpers::get_elapsed_time(void) {
struct timespec current;
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &current)) {
exit(EXIT_FAILURE);
}
long long current_time = current.tv_sec*NANOS + current.tv_nsec;
long long elapsed_micro = (current_time - start_time)/1000 + \
((current_time - start_time) % 1000 >= 500);
return elapsed_micro;
}
bool helpers::has_reached_timeout(void) {
return helpers::timeout_flag;
}
int main() {
helpers::setup();
ifstream in("input.txt");
in.close();
ofstream out("output.txt");
random_device rd;
mt19937 eng(rd());
uniform_int_distribution<> distr(minrand, maxrand);
int i = 0;
while(!helpers::has_reached_timeout()) {
int nmsec;
for(int n=0; n<numcycles; n++) {
nmsec = distr(eng);
}
cout << "i: " << i << "\t- nmsec: " << nmsec << "\t- ";
out << "i: " << i << "\t- nmsec: " << nmsec << "\t- ";
cout << "program has been running for " << \
helpers::get_elapsed_time() << " microseconds" << endl;
out << "program has been running for " << \
helpers::get_elapsed_time() << " microseconds" << endl;
i++;
}
return 0;
}
I compile it as follows:
/usr/bin/g++ -DDEBUG -Wall -std=c++11 -O2 -pipe -static -s -o test_signal test_signal.cpp
On my laptop it correctly gets a SIGXCPU after 2 seconds, see the output:
$ /usr/bin/time -v ./test_signal
[DEBUG] helpers::set_signal: set_signal() called
[DEBUG] helpers::set_signal: action.sa_handler: 1
[DEBUG] helpers::set_limit: set_limit() called
[DEBUG] helpers::set_limit: current limit: 2 seconds
i: 0 - nmsec: 11 - program has been running for 150184 microseconds
i: 1 - nmsec: 18 - program has been running for 294497 microseconds
i: 2 - nmsec: 9 - program has been running for 422220 microseconds
i: 3 - nmsec: 5 - program has been running for 551882 microseconds
i: 4 - nmsec: 20 - program has been running for 685373 microseconds
i: 5 - nmsec: 16 - program has been running for 816642 microseconds
i: 6 - nmsec: 9 - program has been running for 951208 microseconds
i: 7 - nmsec: 20 - program has been running for 1085614 microseconds
i: 8 - nmsec: 20 - program has been running for 1217199 microseconds
i: 9 - nmsec: 12 - program has been running for 1350183 microseconds
i: 10 - nmsec: 17 - program has been running for 1486431 microseconds
i: 11 - nmsec: 13 - program has been running for 1619845 microseconds
i: 12 - nmsec: 20 - program has been running for 1758074 microseconds
i: 13 - nmsec: 11 - program has been running for 1895408 microseconds
[DEBUG] helpers::signal_handler: signal 24 received
[DEBUG] helpers::signal_handler: exiting after 2003326 microseconds
Command being timed: "./test_signal"
User time (seconds): 1.99
System time (seconds): 0.00
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.01
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1644
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 59
Voluntary context switches: 1
Involuntary context switches: 109
Swaps: 0
File system inputs: 0
File system outputs: 16
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
If I compile and run in a virtual machine (VirtualBox, running Ubuntu), I get this:
$ /usr/bin/time -v ./test_signal
[DEBUG] helpers::set_signal: set_signal() called
[DEBUG] helpers::set_signal: action.sa_handler: 1
[DEBUG] helpers::set_limit: set_limit() called
[DEBUG] helpers::set_limit: current limit: 2 seconds
i: 0 - nmsec: 12 - program has been running for 148651 microseconds
i: 1 - nmsec: 13 - program has been running for 280494 microseconds
i: 2 - nmsec: 7 - program has been running for 428390 microseconds
i: 3 - nmsec: 5 - program has been running for 580805 microseconds
i: 4 - nmsec: 10 - program has been running for 714362 microseconds
i: 5 - nmsec: 19 - program has been running for 846853 microseconds
i: 6 - nmsec: 20 - program has been running for 981253 microseconds
i: 7 - nmsec: 7 - program has been running for 1114686 microseconds
i: 8 - nmsec: 7 - program has been running for 1249530 microseconds
i: 9 - nmsec: 12 - program has been running for 1392096 microseconds
i: 10 - nmsec: 20 - program has been running for 1531859 microseconds
i: 11 - nmsec: 19 - program has been running for 1667021 microseconds
i: 12 - nmsec: 13 - program has been running for 1818431 microseconds
i: 13 - nmsec: 17 - program has been running for 1973182 microseconds
i: 14 - nmsec: 7 - program has been running for 2115423 microseconds
i: 15 - nmsec: 20 - program has been running for 2255140 microseconds
i: 16 - nmsec: 13 - program has been running for 2394162 microseconds
i: 17 - nmsec: 10 - program has been running for 2528274 microseconds
i: 18 - nmsec: 15 - program has been running for 2667978 microseconds
i: 19 - nmsec: 8 - program has been running for 2803725 microseconds
i: 20 - nmsec: 9 - program has been running for 2940610 microseconds
i: 21 - nmsec: 19 - program has been running for 3075349 microseconds
i: 22 - nmsec: 14 - program has been running for 3215255 microseconds
i: 23 - nmsec: 5 - program has been running for 3356515 microseconds
i: 24 - nmsec: 5 - program has been running for 3497369 microseconds
[DEBUG] helpers::signal_handler: signal 24 received
[DEBUG] helpers::signal_handler: exiting after 3503271 microseconds
Command being timed: "./test_signal"
User time (seconds): 3.50
System time (seconds): 0.00
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.52
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1636
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 59
Voluntary context switches: 0
Involuntary context switches: 106
Swaps: 0
File system inputs: 0
File system outputs: 16
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Even running the binary compiled on my laptop, the process gets killed after around 3 seconds of elapsed user time.
Any idea of what could be causing this? For a broader context see, this thread: https://github.com/cms-dev/cms/issues/851

julia, why does memory allocation happen for a loop inside a function?

In the memory allocation report of julia --track-allocation=user the maximum of allocation is in this function:
- function fuzzy_dot_square( v::Array{Int64, 1} )
- dot_prod = zero(Int64)
7063056168 for i::Int64 in 2:28
0 dot_prod += v[i]*(v[i] + v[i-1] + v[i+1] + v[i+28])# / 4 # no "top" pixel
- end
0 for i in 29:(28*27) # compiler should literate 28*27
0 dot_prod += v[i]*(v[i] + v[i-1] + v[i+1] + v[i-28] + v[i+28])# / 5 # all pixels
- end
0 for i in (28*27):(28*28 - 1)
0 dot_prod += v[i]*(v[i] + v[i-1] + v[i+1] + v[i-28])# / 4 # no "bottom" pixel
- end
-
0 return dot_prod
- end
-- it is a "fuzzy dot product" square of a vector, representing pixel image 28 by 28 (the known MNIST dataset of digit images).
Why does the allocation happen there?
As far as I understand, the dot_prod is the only thing to be allocated.
But the report points at the first for..
Also I tried reproducing it in repl with:
v = Array{Int64,1}(1:100)
dot_prod = zero(Int64)
#allocated for i in 2:28
dot_prod += v[i]
end
-- and I get following error at #allocated for ...:
ERROR: UndefVarError: dot_prod not defined
in macro expansion at ./REPL[3]:2 [inlined]
in (::##1#f#1)() at ./util.jl:256
The #time macro works fine, so probably there is some bug in #allocated? I have julia 0.5.0.
This is a limitation of --track-allocation=user. There's no type instability and there are no allocations.
julia> function fuzzy_dot_square(v)
dot_prod = zero(eltype(v))
for i in 2:28
dot_prod += v[i]*(v[i] + v[i-1] + v[i+1] + v[i+28])# / 4 # no "top" pixel
end
for i in 29:(28*27) # compiler should literate 28*27
dot_prod += v[i]*(v[i] + v[i-1] + v[i+1] + v[i-28] + v[i+28])# / 5 # all pixels
end
for i in (28*27):(28*28 - 1)
dot_prod += v[i]*(v[i] + v[i-1] + v[i+1] + v[i-28])# / 4 # no "bottom" pixel
end
return dot_prod
end
fuzzy_dot_square (generic function with 1 method)
julia> const xs = [1:28^2;];
julia> #allocated fuzzy_dot_square(xs)
0
See also this passage from the Julia documentation:
In interpreting the results, there are a few important details. Under the user setting, the first line of any function directly called from the REPL will exhibit allocation due to events that happen in the REPL code itself. More significantly, JIT-compilation also adds to allocation counts, because much of Julia’s compiler is written in Julia (and compilation usually requires memory allocation). The recommended procedure is to force compilation by executing all the commands you want to analyze, then call Profile.clear_malloc_data() to reset all allocation counters. Finally, execute the desired commands and quit Julia to trigger the generation of the .mem files.
And for further information, see this Julia issue.

How does memory leak checkers like Valgrind identify free

I want to understand how does memory leak checkers identify if a free has been called for a given malloc.
malloc can easily be identified by brk system calls, so if i am writing a profiler and do a single stepping on a process which breaks at system calls i can easily understand that malloc has been done.
How can i find if a free has been called for this malloc?
Below is the output from strace. This code has free, how can we tell if free was invoked by checking this strace -
read(0, "13608\n", 4096) = 6
brk(0) = 0x8cc6000
brk(0x8ce7000) = 0x8ce7000
write(1, "File name - /proc/13608/maps\n", 29) = 29
open("/proc/13608/maps", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x55559000
read(3, "00349000-00363000 r-xp 00000000 "..., 4096) = 1046
write(1, "ptr1-ffd1f49a\n", 14) = 14
write(1, "ptr2-ffd1f4a8\n", 14) = 14
write(1, "Buffer read - 00349000-00363000 "..., 102) = 102
write(1, "\n", 1) = 1
write(1, "ptr1-ffd1f49a\n", 14) = 14
write(1, "ptr2-ffd1f4aa\n", 14) = 14
write(1, "Buffer read - 00367000-004a6000 "..., 104) = 104
write(1, "\n", 1) = 1
write(1, "ptr1-ffd1f49a\n", 14) = 14
write(1, "ptr2-ffd1f4bd\n", 14) = 14
write(1, "Buffer read - 08048000-08049000 "..., 123) = 123
write(1, "\n", 1) = 1
write(1, "ptr1-ffd1f49a\n", 14) = 14
write(1, "ptr2-ffd1f4a1\n", 14) = 14
write(1, "Buffer read - ffad8000-ffaf1000 "..., 95) = 95
write(1, "\n", 1) = 1
write(1, "ptr1-ffd1f479\n", 14) = 14
write(1, "ptr2-ffd1f479\n", 14) = 14
write(1, "Buffer read - ffffe000-fffff000 "..., 55) = 55
write(1, "\n", 1) = 1
read(3, "", 4096) = 0
close(3) = 0
munmap(0x55559000, 4096) = 0
write(1, "Starting Address - 00349000\n", 28) = 28
write(1, "Ending Address - 00363000\n", 26) = 26
write(1, "Permissions - r-xp\n", 19) = 19
write(1, "Offset - 00000000\n", 18) = 18
write(1, "PathName - </lib/ld-2.5.so>\n", 28) = 28
write(1, "\n", 1) = 1
write(1, "\n", 1) = 1
write(1, "Starting Address - 00367000\n", 28) = 28
write(1, "Ending Address - 004a6000\n", 26) = 26
write(1, "Permissions - r-xp\n", 19) = 19
write(1, "Offset - 00000000\n", 18) = 18
write(1, "PathName - </lib/libc-2.5.so>\n", 30) = 30
write(1, "\n", 1) = 1
write(1, "\n", 1) = 1
write(1, "Starting Address - 08048000\n", 28) = 28
write(1, "Ending Address - 08049000\n", 26) = 26
write(1, "Permissions - r-xp\n", 19) = 19
write(1, "Offset - 00000000\n", 18) = 18
write(1, "PathName - </fs_user/samirba/myP"..., 49) = 49
write(1, "\n", 1) = 1
write(1, "\n", 1) = 1
write(1, "Starting Address - ffad8000\n", 28) = 28
write(1, "Ending Address - ffaf1000\n", 26) = 26
write(1, "Permissions - rw-p\n", 19) = 19
write(1, "Offset - 7ffffffe6000\n", 22) = 22
write(1, "PathName - <[stack]>\n", 21) = 21
write(1, "\n", 1) = 1
write(1, "\n", 1) = 1
write(1, "Starting Address - ffffe000\n", 28) = 28
write(1, "Ending Address - fffff000\n", 26) = 26
write(1, "Permissions - r-xp\n", 19) = 19
write(1, "Offset - ffffe000\n", 18) = 18
write(1, "PathName - <EMPTY>\n", 19) = 19
write(1, "\n", 1) = 1
write(1, "\n", 1) = 1
exit_group(0) = ?
There is no one to one relationship between a malloc call and a system call.
Typically, a malloc library will get big blocks from the OS
using e.g. brk system call or mmap system call.
Then these big block(s) will be cut in smaller blocks to serve
successive malloc calls. A free will usually not cause a system call
(e.g. munmap) to be called.
So, you cannot really track malloc and free at system call level.
Valgrind can track the memory leaks because it intercepts (and replaces)
malloc, free, ...
The Valgrind replacements functions are maintaining a list of allocated blocks.
Real leaks (i.e. memory which cannot be reached anymore, i.e. all pointers to it
have been lost/erased) are found by Valgrind using a scan of all the active
memory.
AFAIK, the memory blocks allocated by OS are identified by the starting address. So look for free() that is called with the same argument, which was returned by malloc() previously. As strace logs more low-level mmap and brk calls, use ltrace to log high-level library calls, keeping eye at the return values and arguments.

Resources