Unit of maximum resident set size in the output of time - macos

Can someone let me know what is the unit of maximum resident size in the output below?
/usr/bin/time -l mvn clean package -T 7 -DskipTests
...
real 530.51
user 837.49
sys 64.28
3671834624 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
2113909 page reclaims
26733 page faults
0 swaps
5647 block input operations
26980 block output operations
15 messages sent
25 messages received
687 signals received
406533 voluntary context switches
1319461 involuntary context switches
I am trying to measure peak memory usage of a process as mentioned here.
Environment - Mac OS X Sierra (10.12.5 )

The unit of Maximum Resident Size is bytes.

Related

Imagemagick convert ultra slow

I'm trying to optimize a 8 seconds / 1.1 mb gif, it's taking more than 5 minutes on my mac and it doesn't finish, I just abort.
convert ep.gif -coalesce -layers Optimize ep-optimized.gif
Resource
>>> convert -list resources
Resource limits:
Width: 461.169PP
Height: 461.169PP
Area: 17.1799GP
List length: unlimited
Memory: 8GiB
Map: 16GiB
Disk: unlimited
File: 192
Thread: 4
Throttle: 0
Time: unlimited
A frame of your video is 1644x810x3 bytes, i.e. 4MB, if your ImageMagick was compiled at Q8. You can check with:
magick identify -version
Version: ImageMagick 7.1.0-5 Q16 x86_64 2021-08-22 https://imagemagick.org
You can see mine is Q16, so each frame is now 8MB.
Your GIF has 521 frames, so your minimum RAM requirement, just to load your image and not even start creating an output image, is:
1644x810x3x2x521 = 4GB
Checking memory usage on my machine, I get:
/usr/bin/time -l magick ep.gif -coalesce -layers Optimize ep-optimized.gif
190.22 real 1568.54 user 102.18 sys
17268342784 maximum resident set size <--- 17 GB !!!
0 average shared memory size
0 average unshared data size
0 average unshared stack size
11073112 page reclaims
14 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
1 voluntary context switches
15028430 involuntary context switches
4523216162264 instructions retired
6123562866718 cycles elapsed
17278709760 peak memory footprint
I think you either need to:
recompile a Q8 version of ImageMagick, or
allocate more RAM (if you have it), or
target a lower resolution or frame-rate, or
consider using a different format - such as video.

Should the stats reported by Go's runtime.ReadMemStats approximately equal the resident memory set reported by ps aux?

In Go Should the "Sys" stat or any other stat/combination reported by runtime.ReadMemStats approximately equal the resident memory set reported by ps aux?
Alternatively, assuming some memory may be swapped out, should the Sys stat be approximately greater than or equal to the RSS?
We have a long-running web service that deals with a high frequency of requests and we are finding that the RSS quickly climbs up to consume virtually all of the 64GB memory on our servers. When it hits ~85% we begin to experience considerable degradation in our response times and in how many concurrent requests we can handle. The run I've listed below is after about 20 hours of execution, and is already at 51% memory usage.
I'm trying to determine if the likely cause is a memory leak (we make some calls to CGO). The data seems to indicate that it is, but before I go down that rabbit hole I want to rule out a fundamental misunderstanding of the statistics I'm using to make that call.
This is an amd64 build targeting linux and executing on CentOS.
Reported by runtime.ReadMemStats:
Alloc: 1294777080 bytes (1234.80MB) // bytes allocated and not yet freed
Sys: 3686471104 bytes (3515.69MB) // bytes obtained from system (sum of XxxSys below)
HeapAlloc: 1294777080 bytes (1234.80MB) // bytes allocated and not yet freed (same as Alloc above)
HeapSys: 3104931840 bytes (2961.09MB) // bytes obtained from system
HeapIdle: 1672339456 bytes (1594.87MB) // bytes in idle spans
HeapInuse: 1432592384 bytes (1366.23MB) // bytes in non-idle span
Reported by ps aux:
%CPU %MEM VSZ RSS
1362 51.3 306936436 33742120

Ways of optimizing a CPU Intensive Golang WebApp

I have a toy web app which is very cpu intensive
func PerfServiceHandler(w http.ResponseWriter, req *http.Request)
{
start := time.Now()
w.Header().Set("Content-Type", "application/json")
x := 0
for i := 0; i < 200000000; i++ {
x = x + 1
x = x - 1
}
elapsed := time.Since(start)
w.Write([]byte(fmt.Sprintf("Time Elapsed %s", elapsed)))
}
func main()
{
http.HandleFunc("/perf", PerfServiceHandler)
http.ListenAndServe(":3000", nil)
}
The above function takes about 120 ms to execute. But when I do a load test this app with 500 concurrent users(siege -t30s -i -v -c500 http://localhost:3000/perf) the results I got
Average Resp Time per request 2.51 secs
Transaction Rate 160.57 transactions per second
Can someone answer my queries below:-
When I ran with 100, 200, 500 concurrent users I saw the no. of OS threads used by the above app got stuck to 35 from 7 when the app was just started. Increasing the no.of concurrent connection does not change this number. Even when 500 concurrent requests arrive at the server the number of OS threads were still stuck at 35 OS threads (The app was started with runtime.GOMAXPROCS(runtime.NumCPU())). When the test stopped the number was still 35.
Can someone explain me this behaviour?
Can the no. of OS threads be increased somehow (from OS or from GOlang)?
Will this improve the performance if no. of OS threads are increased?
Can someone suggest some other ways of optimizing this app?
Environment:-
Go - go1.4.1 linux/amd64
OS - Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u2 x86_64 GNU/Linux
Processor - 2.6Ghz (Intel(R) Xeon(R) CPU E5-2640 v3 # 2.60GHz)
RAM - 64 GB
OS Parameters -
nproc - 32
cat /proc/sys/kernel/threads-max - 1031126
ulimit -u - 515563
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515563
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 515563
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Multiple goroutines can correspond to a single os thread. The design is described here: https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw/edit, which references this paper: http://supertech.csail.mit.edu/papers/steal.pdf.
On to the questions:
Even when 500 concurrent requests arrive at the server the number of OS threads were still stuck at 35 OS threads [...] Can someone explain me this behaviour?
Since you set GOMAXPROCS to the # of CPUs go will only run that many goroutines at a time.
One thing that may be a little confusing is that goroutines aren't always running (sometimes they are "busy"). For example if you read a file, while the OS is doing that work the goroutine is busy and the scheduler will pick up another goroutine to run (assuming there is one). Once the file read is complete that goroutine goes back into the list of "runnable" goroutines.
The creation of OS level threads is handled by the scheduler and there are additional complexities around system-level calls. (Sometimes you need a real, dedicated thread. See: LockOSThread) But you shouldn't expect a ton of threads.
Can the no. of OS threads be increased somehow (from OS or from GOlang)?
I think using LockOSThread may result in the creation of new threads, but it won't matter:
Will this improve the performance if no. of OS threads are increased?
No. Your CPU is fundamentally limited in how many things it can do at once. Goroutines work because it turns out most operations are IO bound in some way, but if you are truly doing something CPU bound, throwing more threads at the problem won't help. In fact it will probably make it worse, since there is overhead involved in switching between threads.
In other words Go is making the right decision here.
Can someone suggest some other ways of optimizing this app?
for i := 0; i < 200000000; i++ {
x = x + 1
x = x - 1
}
I take it you wrote this code just to make the CPU do a lot of work? What does the actual code look like?
Your best bet will be finding a way to optimize that code so it needs less CPU time. If that's not possible (its already highly optimized), then you will need to add more computers / CPUs to the mix. Get a better computer, or more of them.
For multiple computers you can put a load balancer in front of all your machines and that should scale pretty easily.
You may also benefit by pulling this work off of the webserver and moving it to some backend system. Consider using a work queue.

How to interpret stats/show in Rebol 3

Wanted to make some profiling on a R3 script and was checking at the stats command.
But what do these informations mean?
How can it be used to monitor memory usage?
>> stats/show
Series Memory Info:
node size = 16
series size = 20
5 segs = 409640 bytes - headers
4888 blks = 812448 bytes - blocks
1511 strs = 86096 bytes - byte strings
2 unis = 86016 bytes - unicode strings
4 odds = 39216 bytes - odd series
6405 used = 1023776 bytes - total used
0 free / 14075 bytes - free headers / node-space
Pool[ 0] 8B 202/ 3328: 256 ( 6%) 13 segs, 26728 total
Pool[ 1] 16B 178/ 512: 256 (34%) 2 segs, 8208 total
Pool[ 2] 32B 954/ 2560: 512 (37%) 5 segs, 81960 total
...
Pool[26] 64B 0/ 0: 128 ( 0%) 0 segs, 0 total
Pools used 654212 of 1906200 (34%)
System pool used 497664
== 1023776
It shows the internal memory management information, not sure how useful it would be to the script.
Anyway, here are some explanations about the memory pools.
Most pools are for series (there is a dedicated pool for GOB!s, and some others if you're looking at Atronix source code), to make it simple, I will focus on series pools here.
Internally, a series has a header and its data which is a chunk of contiguous memory. The header has the width and length info about the series. The data holds the actual content of the series. In R3, Series is used extensively to implement block!, port!, string!, object!, etc. So managing memory in R3 is almost managing (allocating and destroying) series. Because of the difference in the width and length of serieses, pools are introduced to reduce the fragmentation.
When a new series is needed, the header is allocated in a special pool, and another pool is chosen for its data. The pool whose width is closed to the size of the series is chosen. E.g. a block with 3 elements will probably be allocated in a pool with width of 128-byte (on 32-bit systems, a block is a series with 4 (3 + 1 terminater) elements). As a pool could increase as the the program runs, it's implemented as a list of segments. New segments will be allocated and appended to the list as needed (but it's never released back to system).
Another special pool is the system pool, which is chosen when the required memory is big. R3 doesn't actually manage this pool other than collecting some statistics.
When it tries to collect garbage, it will sweep the root context, and mark everything that can be reachable, then it will go through the series header pool, and find out all unneeded serieses and destroy them.
If you use stats without a refinement, you can see the actual memory usage. So comparing memory usage before and after your implementations you can see which one uses less memory.
>> stats
== 1129824
>> s: make string! 1024
== ""
>> stats
== 1132064

Using time command for benchmarking

I'm trying to use the time command as a simple solution for benchmarking some scripts that do a lot of text processing and makes a number of network calls. To evaluate if its a good fit, I tried doing:
/usr/bin/time -f "\n%E elapsed,\n%U user,\n%S system, \n %P CPU, \n%M
max-mem footprint in KB, \n%t avg-mem footprint in KB, \n%K Average total
(data+stack+text) memory,\n%F major page faults, \n%I file system
inputs by the process, \n%O file system outputs by the process, \n%r
socket messages received, \n%s socket messages sent, \n%x status" yum
install nmap
and got:
1:35.15 elapsed,
3.17 user,
0.40 system,
3% CPU,
0 max-mem footprint in KB,
0 avg-mem footprint in KB,
0 Average total (data+stack+text) memory,
127 major page faults,
0 file system inputs by the process,
0 file system outputs by the process,
0 socket messages received,
0 socket messages sent,
0 status
which is not exactly what I was expecting - specially the 0 values. Even when I change the command to say ping google.com, the socket messages are 0. What's going on? Is there any alternative?
[And I'm confused if it should stay here or be posted in serverfault]
I think it's not working with Linux; I assume you're using Linux since you said "strace". The manual page says:
Bugs
Not all resources are measured by all versions of Unix,
so some of the values might be reported as zero. The present
selection was mostly inspired by the data provided by 4.2 or
4.3BSD.
I tried "wget" on an OSX system (which is BSD-ish) to check if it report socket statistics, and there at least socket works:
0.00 user,
0.01 system,
1% CPU,
0 max-mem footprint in KB,
0 avg-mem footprint in KB,
0 Average total (data+stack+text) memory,
0 major page faults,
0 file system inputs by the process,
0 file system outputs by the process,
151 socket messages received,
8 socket messages sent,
0 status
Hope that helps,
Alex.
Do not use time to benchmark. Some of the fields of the time command is broken as specified in [1]. However the basic functionality of time (real , user and cpu time) are still intact.
[1] Maximum resident set size does not make sense

Resources