I am running few experiments to see changes in system behavior under different memory and cpu loads. I was wondering is there a bash script which constantly uses high memory but low CPU?
For the purpose of simulating CPU/memory/IO load, most *NIX systems (Linux included) provide handy tool called stress.
The tool varies from OS to OS. On Linux, to take up 512MB of RAM with low CPU load:
stress --vm 1 --vm-bytes 512M --vm-hang 100
(The invocation means: start one memory thread (--vm 1), allocate/free 512MB of memory in every thread, sleep before freeing memory 100 seconds.)
This is silly, and can't be reasonably expected to provide data which will be useful in any real-world scenario. However, to generate at least the amount of memory consumption associated with a given power-of-two bytes:
build_string() {
local pow=$1
local dest=$2
s=' '
for (( i=0; i<pow; i++ )); do
s+="$s"
done
printf -v "$dest" %s "$s"
}
build_string 10 kilobyte # build a string of length 1024
echo "Kilobyte string consumes ${#kilobyte} bytes"
build_string 20 megabyte # build a string of length 1048576
echo "Megabyte string consumes ${#megabyte} bytes"
Note that transiently, during construction, at least 2x the requested space will be required (for the local); a version that didn't have this behavior would either be using namevars (depending on bash 4.3) or eval (depending on the author's willingness to do evil).
Related
When using the tool 'free' in linux, we can see several values of memory aspects:
[root#coconut-stateless-clients-5 ~] 2021-08-03 17:28:07 $ free
total used free shared buff/cache available
Mem: 62907052 382180 61985152 4788 539720 61933812
Swap: 0 0 0
I need to lower the 'free' memory value and keep 'available' value unchanged (as much as I can)
How can I 'fill' up the cache memory on the expense of 'free' in a linux machine?
Cache memory is filled by the kernel in various cases. Usually the operating system stores binaries or other files it currently works with. For example, data that is displayed, or sent to other machines is kept in the cache memory.
That mechanism can be used in order to load files or data to the cache. Hence, achieving both goals of reducing 'free' memory and filling 'cache'.
For that we can use the BSD reading tool 'head' that is used for reading lines or bytes from file.
The lines or bytes can be read to memory and then will be loaded to cache only momentarily. Or, can be read to a file and the data read and cached in memory until the memory is required for other purpose (and no other space left).
With the help of this article you can get more familiar with the details. But the following example is suffice if you just want to achieve the goals.
Fill 'cache' with x GiB/MiB of data and reduce 'free' with the same space:
2GiB example:
# head -c 2G /dev/urandom > dummy.file
250MiB example:
# head -c 250M /dev/urandom > dummy.file
In order to free up ALL the cached space, run this command:
# echo 3 > /proc/sys/vm/drop_caches
I'm using the new TensorFlow profiler to profile memory usage in my neural net, which I'm running on a Titan X GPU with 12GB RAM. Here's some example output when I profile my main training loop:
==================Model Analysis Report======================
node name | requested bytes | ...
Conv2DBackpropInput 10227.69MB (100.00%, 35.34%), ...
Conv2D 9679.95MB (64.66%, 33.45%), ...
Conv2DBackpropFilter 8073.89MB (31.21%, 27.90%), ...
Obviously this adds up to more than 12GB, so some of these matrices must be in main memory while others are on the GPU. I'd love to see a detailed breakdown of what variables are where at a given step. Is it possible to get more detailed information on where various parameters are stored (main or GPU memory), either with the profiler or otherwise?
"Requested bytes" shows a sum over all memory allocations, but that memory can be allocated and de-allocated. So just because "requested bytes" exceeds GPU RAM doesn't necessarily mean that memory is being transferred to CPU.
In particular, for a feedforward neural network, TF will normally keep around the forward activations, to make backprop efficient, but doesn't need to keep the intermediate backprop activations, i.e. dL/dh at each layer, so it can just throw away these intermediates after it's done with these. So I think in this case what you care about is the memory used by Conv2D, which is less than 12 GB.
You can also use the timeline to verify that total memory usage never exceeds 12 GB.
I have a toy web app which is very cpu intensive
func PerfServiceHandler(w http.ResponseWriter, req *http.Request)
{
start := time.Now()
w.Header().Set("Content-Type", "application/json")
x := 0
for i := 0; i < 200000000; i++ {
x = x + 1
x = x - 1
}
elapsed := time.Since(start)
w.Write([]byte(fmt.Sprintf("Time Elapsed %s", elapsed)))
}
func main()
{
http.HandleFunc("/perf", PerfServiceHandler)
http.ListenAndServe(":3000", nil)
}
The above function takes about 120 ms to execute. But when I do a load test this app with 500 concurrent users(siege -t30s -i -v -c500 http://localhost:3000/perf) the results I got
Average Resp Time per request 2.51 secs
Transaction Rate 160.57 transactions per second
Can someone answer my queries below:-
When I ran with 100, 200, 500 concurrent users I saw the no. of OS threads used by the above app got stuck to 35 from 7 when the app was just started. Increasing the no.of concurrent connection does not change this number. Even when 500 concurrent requests arrive at the server the number of OS threads were still stuck at 35 OS threads (The app was started with runtime.GOMAXPROCS(runtime.NumCPU())). When the test stopped the number was still 35.
Can someone explain me this behaviour?
Can the no. of OS threads be increased somehow (from OS or from GOlang)?
Will this improve the performance if no. of OS threads are increased?
Can someone suggest some other ways of optimizing this app?
Environment:-
Go - go1.4.1 linux/amd64
OS - Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u2 x86_64 GNU/Linux
Processor - 2.6Ghz (Intel(R) Xeon(R) CPU E5-2640 v3 # 2.60GHz)
RAM - 64 GB
OS Parameters -
nproc - 32
cat /proc/sys/kernel/threads-max - 1031126
ulimit -u - 515563
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515563
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 515563
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Multiple goroutines can correspond to a single os thread. The design is described here: https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw/edit, which references this paper: http://supertech.csail.mit.edu/papers/steal.pdf.
On to the questions:
Even when 500 concurrent requests arrive at the server the number of OS threads were still stuck at 35 OS threads [...] Can someone explain me this behaviour?
Since you set GOMAXPROCS to the # of CPUs go will only run that many goroutines at a time.
One thing that may be a little confusing is that goroutines aren't always running (sometimes they are "busy"). For example if you read a file, while the OS is doing that work the goroutine is busy and the scheduler will pick up another goroutine to run (assuming there is one). Once the file read is complete that goroutine goes back into the list of "runnable" goroutines.
The creation of OS level threads is handled by the scheduler and there are additional complexities around system-level calls. (Sometimes you need a real, dedicated thread. See: LockOSThread) But you shouldn't expect a ton of threads.
Can the no. of OS threads be increased somehow (from OS or from GOlang)?
I think using LockOSThread may result in the creation of new threads, but it won't matter:
Will this improve the performance if no. of OS threads are increased?
No. Your CPU is fundamentally limited in how many things it can do at once. Goroutines work because it turns out most operations are IO bound in some way, but if you are truly doing something CPU bound, throwing more threads at the problem won't help. In fact it will probably make it worse, since there is overhead involved in switching between threads.
In other words Go is making the right decision here.
Can someone suggest some other ways of optimizing this app?
for i := 0; i < 200000000; i++ {
x = x + 1
x = x - 1
}
I take it you wrote this code just to make the CPU do a lot of work? What does the actual code look like?
Your best bet will be finding a way to optimize that code so it needs less CPU time. If that's not possible (its already highly optimized), then you will need to add more computers / CPUs to the mix. Get a better computer, or more of them.
For multiple computers you can put a load balancer in front of all your machines and that should scale pretty easily.
You may also benefit by pulling this work off of the webserver and moving it to some backend system. Consider using a work queue.
I'd like to be able to test some guesses about memory complexity of various command line utilities.
Taking as a simple example
grep pattern file
I'd like to see how memory usage varies with the size of pattern and the size of file.
For time complexity, I'd make a guess, then run
time grep pattern file
on various sized inputs to see if my guess seems to be borne out in reality, but I don't know how to do this for memory.
One possibility would be a wrapper script that initiates the job and samples memory usage periodically, but this seems inelegant and unlikely to give the real high watermark.
I've seen time -v suggested, but don't have that flag available on my machine (running bash on OSX) and don't know where to find a version that supports it.
I've also seen that on Linux this information is available through the proc filesystem, but again, it's not available to me in my context.
I'm wondering if dtrace might be an appropriate tool, but again am concerned that a simple sample-based figure might not be the true high watermark?
Does anyone know of a tool or approach that would be appropriate on OSX?
Edit
I removed two mentions of disk usage, which were just asides and perhaps distracted from the main thrust of the question.
Your question is interesting because, without the application source code, you need to make a few assumptions about what constitutes memory use. Even if you were to use procfs, the results will be misleading: both the resident set size and the total virtual address space will be over-estimates since they will include extraneous data such as the program text.
Particularly for small commands, it would be easier to track individual allocations, although even there you need to be sure to include all the possible sources. In addition to malloc() etc., a process can extend its heap with brk() or obtain anonymous memory using mmap().
Here's a DTrace script that traces malloc(); you can extend it to include other allocating functions. Note that it isn't suitable for multi-threaded programs as it uses some non-atomic variables.
bash-3.2# cat hwm.d
/* find the maximum outstanding allocation provided by malloc() */
size_t total, high;
pid$target::malloc:entry
{
self->size = arg0;
}
pid$target::malloc:return
/arg1/
{
total += self->size;
allocation[arg1] = self->size;
high = (total > high) ? total : high;
}
pid$target::free:entry
/allocation[arg0]/
{
total -= allocation[arg0];
allocation[arg0] = 0;
}
END
{
printf("High water mark was %d bytes.\n", high);
}
bash-3.2# dtrace -x evaltime=exec -qs hwm.d -c 'grep maximum hwm.d'
/* find the maximum outstanding allocation provided by malloc() */
High water mark was 62485 bytes.
bash-3.2#
A much more comprehensive discussion of memory allocators is contained in this article by Brendan Gregg. It provides a much better answer than my own to your question. In particular, it includes a link to a script called memleak.d; modify this to include time stamps for the allocations & deallocations, so that you can sort its output by time. Then, perhaps using the accompanying script as an example, use perl to track the current outstanding total allocation and high water mark. Such a DTrace/perl combination would be suitable for tracing multi-threaded processes.
You can use /usr/bin/time -l (which is not the time builtin in macos) and read the "maximum resident set size", which is not precisely high water mark but might give you some idea.
$ /usr/bin/time -l ls
...
0.00 real 0.00 user 0.00 sys
925696 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
239 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
3 voluntary context switches
1 involuntary context switches
The meaning of this field is explained here.
Tried getrusage(). Inaccurate results. Tried Instruments. Pain in the arse.
Best solution by far: valgrind + massif.
command-line based: easy to run, script and automate; no apps to open, menus to click, blah blah; can run in background etc
provides a visual graph-- in your terminal-- of memory usage over time
valgrind --tool=massif /path/to/my_program arg1 ...
ms_print `ls -r massif.out.* | head -1` | grep Detailed -B50
To view more details, run ms_print `ls -r massif.out.* | head -1`
I want to ensure that a long-running number crunching algorithm doesn't use too much memory. The algorithm is written in C++ and runs on OS X. A drastically simplified version is:
int main() {
while (someCondition) {
// notice nothing is allocated on the heap
vector<int> v(10, 0);
}
}
I've profiled the code using Instruments (allocations and leaks). I don't see any leaks. And while the "live bytes" count looks fine (hovers around 20 MB) the "overall bytes" count keeps growing. What concerned me is when the "overall count" reached about 80 GB I received an OS X warning about lack of hard disk space (I have a 120 GB solid state disk). I don't know much about OS/process interaction so I thought I'd ask:
Is memory used by a long running process on a UNIX-based OS available to other processes before the first process is killed or no longer running?
Edit: Looks like I'm misinterpreting the "overall bytes" number in Instruments:Instruments ObjectAlloc: Explanation of Live Bytes & Overall Bytes. When I check out the process in Activity Monitor the "real memory" is essentially constant.
The reason you get a disk space warning is probably related to virtual memory allocation. Every time your process (or the OS) requests memory it is usually first "allocated" in backing-store - swap.
Total virtual memory is size of available swap plus RAM. I do not have access to OSX, and I know it plays by its own rules, but there must be a command that shows swap usage
swap -l (Solaris)
swap -s (Solaris)
free (linux)
The only command I came up with is vm_stat, plus top - it appears top is probably the closest to what I am talking about.