Measure script memory usage in GWAN for each request - memory-management

how to measure memory usage on gwan application (each request made)?
for the memory usage consumed by /csp script and /handlers script.

You can use the server_report function.
Check out http://gwan.ch/source/report.c for an example.

To measure the memory consumed by a G-WAN script (either handler or servlet), you will have to consider two things:
code size (see the gwan.log file which dumps it along with an MD5 checksum)
data size (which is dependant on your code so it can only be reported at runtime)
As Paulo suggested it, you can check what every malloc() / calloc() / strdup(), etc. does in your code but you will miss whatever memory used by G-WAN, system or third-party library calls.
The worker thread stack is also dynamically growing when needed... so, unless you know what you do there is no obvious way to precisely check what amount of memory is used by any given script.

Related

Storing Per-Process Data in Kernel Module / Passing Data Between sys_enter and sys_exit Probe

Familiarity with how Linux Kernel Tracepoints work is not necessarily required to help with this question, it is just what is motivating this problem. In essence, I am looking for a way to store per-process data for a kernel module, without modifying the Linux source (e.g. struct task_struct), and ideally without using locks. Here is my specific question:
I have a kernel module that hooks into the sys_enter (defined here for x86_64, aarch64) and sys_exit (x86_64, aarch64) tracepoints. For each system call issued, I need to pass some data between the enter probe and the exit probe.
Some things I have considered: I could ...
...use one global variable -- but that will be shared between concurrently executing system calls on different CPUs, creating a race.
...use one global map from PID (of the process issuing the system call) to my data, together with locks -- but that will unnecessarily require synchronization between all CPUs on each system call. I would like to avoid this, since the data is "local" to each issued system call, so I feel like there should be a way to keep it local and not add costly synchronization.
...use a per-CPU global variable -- but (it is my understanding that) a process may move to another CPU during the system call execution, making this approach incorrect.
...kmallocing some memory for my custom data upon each system call entry, then pass the address to that memory by clobbering one of the registers in struct pt_regs (both the entry and exit probe receive a pointer to said struct) -- but then I will have a memory leak for system calls that do not trigger the exit probe (such as sys_exit, which never returns).
I am open to any suggestions how these ideas could be refined to address the problems I listed, or any completely different ideas that I am not thinking of.
I'd use an RCU enabled hashtable, for safety.
The first option isn't actually doable, as you stated.
The third one requires you to track which process is using which CPU, which seems unnecessary.
The leaking problem of the fourth option can probably be solved somehow, but allocating memory on each system call can introduce a serious delay.
Of course that accessing the hashtable will also slow down the system, but It won't trigger a memory allocation for each system call, so I assume it'll be less harmful.
Also, I may be wrong here, but if you assume that only process creation/destruction will introduce changes to table itself (not to the data within each entry, but the location and hash value of each row) than maybe you won't even have to synchronize on each system call, but only on ones that will cause process creation/destruction.

Does calling `writev` repeatedly with the same memory address allow hardware caching?

I've read some performance claims about how Elixir and Erlang use hardware, and I'm trying to see if I understand their basis. Some background:
First, Erlang supports writing nested lists of immutable strings (iolists) to IO (files, sockets, etc) and uses writev and the strings' memory addresses to do so (see Evan Miller's blog post on this).
Second, the docs for an Erlang web framework called Chicago Boss say:
Erlang Respects Your RAM!
Erlang is different from other platforms because when rendering a server-side template, it doesn't create a separate copy of a web page in memory for each connected client. Instead, it constructs pointers to the same pieces of immutable memory across multiple requests.
So if two people request two different profile pages at the same time, they're actually sent the same chunks of memory for the header, footer, and other shared template snippets. The result is a server that can construct complex, uncached web pages for hundreds of users per second without breaking a sweat.
Third, a book about an Elixir (Erlang VM) web framework called Phoenix says:
Templates are precompiled. Phoenix doesn’t need to copy strings for each rendered template. At the hardware level, you’ll see caching come into play for these strings where it never did before.
From looking at the source, I know that this framework uses iolists to represent a completed response template.
Putting all this together, I think what's being implied is that if a web framework uses writev to tell the OS to send the same header and footer strings from the same memory locations, one web request after another, the hardware will be able to say "oh, I know that value, it's already in CPU cache so I don't have to look in RAM for it."
Is that right? (I have very little understanding of system calls and hardware.) If not, any ideas on how hardware caching is involved?
(Bonus if you can tell me how to see or infer what's happening.)
Yes, it's mostly the processor caches that help you. The time needed to retrieve the data is smaller as it's in a faster memory (ie the CPU caches).
Some pointers for understanding what the caches are and how they work:
https://www.quora.com/How-does-the-cache-memory-in-a-computer-work
http://www.hardwaresecrets.com/how-the-cache-memory-works/
http://lwn.net/Articles/252125/
To see this, measure how much a request takes (client side) in the normal server operation. After that have a separate process within the same vm that constantly creates and writes to disk a very large string (it probably has to be megabytes in size - whatever the size of the L2/L3 caches on your process are). Remeasure how much the request takes - if done correctly this should be at least 1 order of magnitude slower.

what's the memory allocation functions can be called from the interrupt environment in AIX?

xmalloc can be used in the process environment only when I write a AIX kernel extension.
what's the memory allocation functions can be called from the interrupt environment in AIX?
thanks.
The network memory allocation routines. Look in /usr/include/net/net_malloc.h. The lowest level is net_malloc and net_free.
I don't see much documentation in IBM's pubs nor the internet. There are a few examples in various header files.
There is public no prototype that I can find for these.
If you look in net_malloc.h, you will see MALLOC and NET_MALLOC macros defined that call it. Then if you grep in all the files under /usr/include, you will see uses of these macros. From these uses, you can deduce the arguments to the macros and thus deduce the arguments to net_malloc itself. I would make one routine that is a pass through to net_malloc that you controlled the interface to.
On your target system, do "netstat -m". The last bucket size you see will be the largest size you can call net_malloc with the M_NOWAIT flag. M_WAIT can be used only at process time and waits for netm to allocate more memory if necessary. M_NOWAIT returns with a 0 if there is not enough memory pinned. At interrupt time, you must use M_NOWAIT.
There is no real checking for the "type" but it is good to pick an appropriate type for debugging purposes later on. The netm output from kdb shows the type.
In a similar fashion, you can figure out how to call net_free.
Its sad IBM has chosen not to document this. An alternative to get this information officially is to pay for an "ISV" question. If you are doing serious AIX development, you want to become an ISV / Partner. It will save you lots of heart break. I don't know the cost but it is within reach of small companies and even individuals.
This book is nice to have too.

free mem as function of command 'purge'

one of my app needs the function that free inactive/used/wired memory just like command 'purge'.
Check and google a lot, but can not get any hit
Welcome any comment
Purge doesn't do what you seem to think it does. It doesn't "free inactive/used/wired memory". As the manpage says:
It does not affect anonymous memory that has been allocated through malloc, vm_allocate, etc.
All it does is purge the disk cache. This is only useful if you're running performance tests and want to simulate the effects of "first run after cold boot" without actually cold booting. Again, from the manpage:
Purge can be used to approximate initial boot conditions with a cold disk buffer cache for performance analysis.
There is no public API for this, although a quick scan of the symbols shows that it seems to call a function CPOSXPurgeAllDiskBuffers from the CoreProfile private framework. I believe the underlying kernel and userland disk cache code is all or mostly available on http://www.opensource.apple.com, so you could do probably implement the same thing yourself, if you really want.
As iMysak says, you can just exec (or NSTask, etc.) the tool if you want to.
As a side note, it you could free used/wired memory, presumably that memory is used by something—even if you don't have pointers into it in your own data structures, malloc probably does. Are you trying to segfault your code?
Freeing inactive memory is a different story. Just freeing something up to malloc doesn't necessarily make malloc return it to the OS. And there's no way you can force it to. If you think about the way traditional UNIX works, it makes sense: When you ask it to allocate more memory, it uses sbrk to expand your data segment; if you free up memory at the top, it can sbrk back down, but if you free up memory in the middle, there's no way it can do that. Of course modern UNIX systems don't work that way, but the POSIX and C APIs are all designed to be compatible with systems that do. So, if you want to make sure memory gets freed, you have to handle memory allocation directly.
The simplest and most portable way to do this is to create and mmap a temporary backing file, or just MAP_ANON, and explicitly unmap pages when you're done with them. (This works on all POSIX systems—and, with a pretty simple wrapper, even Windows.) If you need even more control (e.g., to manually handle flushing pages to disk, etc.), you can use the mach/mach_vm.h APIs.
You can directly run it from OS // with exec() function

Limiting memory of V8 Context

I have a script server that runs arbitrary java script code on our servers. At any given time multiple scripts can be running and I would like to prevent one misbehaving script from eating up all the ram on the machine. I could do this by having each script run in its own process and have an off the shelf monitoring tool monitor the ram usage of each process, killing and restarting the ones that get out of hand. I don't want to do this because I would like to avoid the cost of restart the binary every time one of these scripts goes crazy. Is there a way in v8 to set a per context/isolate memory limit that I can use to sandbox the running scripts?
It should be easy to do now
context.EstimatedSize() to get estimated size of the context
isolate.TerminateExecution() when context goes out of acceptable memory/cpu usage/whatever
in order to get access if there is an infinite loop(or something else blocking, like high cpu calculation) I think you could use isolate.RequestInterrupt()
A single process can run multiple isolates, if you have a 1 isolate to 1 context ratio you can easily
restrict memory usage per isolate
get heap stats
See some examples in this commit:
https://github.com/discourse/mini_racer/commit/f7ec907547e9a6ea888b2587e4edee3766752dd3
In particular you have:
v8::HeapStatistics stats;
isolate->GetHeapStatistics(&stats);
There are also fancy features like memory allocation callbacks you can use.
This is not reliably possible.
All JavaScript contexts by this process share the same object heap.
WebKit/Chromium tries some stuff to disable contexts after context OOMs.
http://code.google.com/searchframe#OAMlx_jo-ck/src/third_party/WebKit/Source/WebCore/bindings/v8/V8Proxy.cpp&exact_package=chromium&q=V8Proxy&type=cs&l=361
Sources:
http://code.google.com/p/v8/source/browse/trunk/src/heap.h?r=11125&spec=svn11125#280
http://code.google.com/p/chromium/issues/detail?id=40521
http://code.google.com/p/chromium/issues/detail?id=81227

Resources