Imagemagick convert ultra slow - performance

I'm trying to optimize a 8 seconds / 1.1 mb gif, it's taking more than 5 minutes on my mac and it doesn't finish, I just abort.
convert ep.gif -coalesce -layers Optimize ep-optimized.gif
Resource
>>> convert -list resources
Resource limits:
Width: 461.169PP
Height: 461.169PP
Area: 17.1799GP
List length: unlimited
Memory: 8GiB
Map: 16GiB
Disk: unlimited
File: 192
Thread: 4
Throttle: 0
Time: unlimited

A frame of your video is 1644x810x3 bytes, i.e. 4MB, if your ImageMagick was compiled at Q8. You can check with:
magick identify -version
Version: ImageMagick 7.1.0-5 Q16 x86_64 2021-08-22 https://imagemagick.org
You can see mine is Q16, so each frame is now 8MB.
Your GIF has 521 frames, so your minimum RAM requirement, just to load your image and not even start creating an output image, is:
1644x810x3x2x521 = 4GB
Checking memory usage on my machine, I get:
/usr/bin/time -l magick ep.gif -coalesce -layers Optimize ep-optimized.gif
190.22 real 1568.54 user 102.18 sys
17268342784 maximum resident set size <--- 17 GB !!!
0 average shared memory size
0 average unshared data size
0 average unshared stack size
11073112 page reclaims
14 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
1 voluntary context switches
15028430 involuntary context switches
4523216162264 instructions retired
6123562866718 cycles elapsed
17278709760 peak memory footprint
I think you either need to:
recompile a Q8 version of ImageMagick, or
allocate more RAM (if you have it), or
target a lower resolution or frame-rate, or
consider using a different format - such as video.

Related

Unit of maximum resident set size in the output of time

Can someone let me know what is the unit of maximum resident size in the output below?
/usr/bin/time -l mvn clean package -T 7 -DskipTests
...
real 530.51
user 837.49
sys 64.28
3671834624 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
2113909 page reclaims
26733 page faults
0 swaps
5647 block input operations
26980 block output operations
15 messages sent
25 messages received
687 signals received
406533 voluntary context switches
1319461 involuntary context switches
I am trying to measure peak memory usage of a process as mentioned here.
Environment - Mac OS X Sierra (10.12.5 )
The unit of Maximum Resident Size is bytes.

How to interpret stats/show in Rebol 3

Wanted to make some profiling on a R3 script and was checking at the stats command.
But what do these informations mean?
How can it be used to monitor memory usage?
>> stats/show
Series Memory Info:
node size = 16
series size = 20
5 segs = 409640 bytes - headers
4888 blks = 812448 bytes - blocks
1511 strs = 86096 bytes - byte strings
2 unis = 86016 bytes - unicode strings
4 odds = 39216 bytes - odd series
6405 used = 1023776 bytes - total used
0 free / 14075 bytes - free headers / node-space
Pool[ 0] 8B 202/ 3328: 256 ( 6%) 13 segs, 26728 total
Pool[ 1] 16B 178/ 512: 256 (34%) 2 segs, 8208 total
Pool[ 2] 32B 954/ 2560: 512 (37%) 5 segs, 81960 total
...
Pool[26] 64B 0/ 0: 128 ( 0%) 0 segs, 0 total
Pools used 654212 of 1906200 (34%)
System pool used 497664
== 1023776
It shows the internal memory management information, not sure how useful it would be to the script.
Anyway, here are some explanations about the memory pools.
Most pools are for series (there is a dedicated pool for GOB!s, and some others if you're looking at Atronix source code), to make it simple, I will focus on series pools here.
Internally, a series has a header and its data which is a chunk of contiguous memory. The header has the width and length info about the series. The data holds the actual content of the series. In R3, Series is used extensively to implement block!, port!, string!, object!, etc. So managing memory in R3 is almost managing (allocating and destroying) series. Because of the difference in the width and length of serieses, pools are introduced to reduce the fragmentation.
When a new series is needed, the header is allocated in a special pool, and another pool is chosen for its data. The pool whose width is closed to the size of the series is chosen. E.g. a block with 3 elements will probably be allocated in a pool with width of 128-byte (on 32-bit systems, a block is a series with 4 (3 + 1 terminater) elements). As a pool could increase as the the program runs, it's implemented as a list of segments. New segments will be allocated and appended to the list as needed (but it's never released back to system).
Another special pool is the system pool, which is chosen when the required memory is big. R3 doesn't actually manage this pool other than collecting some statistics.
When it tries to collect garbage, it will sweep the root context, and mark everything that can be reachable, then it will go through the series header pool, and find out all unneeded serieses and destroy them.
If you use stats without a refinement, you can see the actual memory usage. So comparing memory usage before and after your implementations you can see which one uses less memory.
>> stats
== 1129824
>> s: make string! 1024
== ""
>> stats
== 1132064

Unenhanced performance of matlab GPU computing

With the intention of comparing the speed of GPU vs CPU computing, I ran the example codes available here (a Mandelbrot set on the GPU) from MATLAB central. Below are the results that I obtained:
Case 1 (without GPU): 6.2 secs
Case 2 (using parallel.gpu.GPUArray): 6.518 secs (1.39 secs in the example)
Case 3 (Using Element-wise Operation): 1.259 secs (0.14 secs in the example)
As can be seen, there is no improvement in case 2 and only slight improvement of around 4 times in case 3. As the example did not state the details of GPU they used, may I know if this is simply due to the "incompetency" of my graphic card or am I missing something important?
The graphic card is also responsible for driving my display (HP Z Display Z23i 23-inch IPS LED Backlit Monitor).
CPU: Intel i7-4790, 3.6 GHz (8 cores)
GPU:
Name: 'NVS 510'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6
ToolkitVersion: 5
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.6934e+09
MultiprocessorCount: 1
ClockRateKHz: 797000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
Thank you!
Edit
The GPU used in the example here is Tesla C2050. (Credits to #Sam Roberts)
The times on that link are most likely for a different GPU in comparison to yours. They don't specify what kind of graphics card they're using, but my guess is that they're using a more higher end card.
By Googling NVS 510, the specs are similar to the card that I have for my machine. However, your card is geared towards business while mine is geared towards gaming. I have a GTX 660 which is one of the higher end GPUs that are available on the market.
These are the attributes of my graphics card:
CUDADevice with properties:
Name: 'GeForce GTX 660'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6.5000
ToolkitVersion: 5.5000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.5357e+09
MultiprocessorCount: 5
ClockRateKHz: 1084500
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
The differences between my card and yours are that I have 5 multiprocessors, and my clock rate is about 300 MHz faster than yours. For a side-by-side comparison, check out my card in comparison to yours:
NVS 510: http://www.nvidia.ca/object/nvs-510-graphics-card.html#pdpContent=2
GTX 660: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-660/specifications
Upon further inspection, I have a much higher memory bandwidth than your card. I also have 960 GPU cores in comparison to your 192.
I decided to run these tests to compare my performance with your timings. My CPU is an i7-4770 3.6 GHz Intel and I have 16 GB of RAM on my machine.
The times that I get by running those examples are the following:
Case #1 - Without GPU: 6.46 seconds
Case #2 - Naive GPU: 0.82 seconds - 7.9x faster
Case #3 - Through CUDA: 0.09 seconds - 71.7x faster
With this, my guess is that your graphics card may be of a lower quality in comparison to those tests that MathWorks performed. Maybe try updating your graphics drivers and see if that helps. However, my guess is that my performance is much better due to the multiprocessor count, faster clock, a higher amount of cores and higher memory bandwidth.

Calculating CPU Performance in MIPS

i was taking an exam earlier and i memorized the questions that i didnt know how to answer but somehow got it correct(since the online exam using electronic classrom(eclass) was done through the use of multiple choice.. The exam was coded so each of us was given random questions at random numbers and random answers on random choices, so yea)
anyways, back to my questions..
1.)
There is a CPU with a clock frequency of 1 GHz. When the instructions consist of two
types as shown in the table below, what is the performance in MIPS of the CPU?
-Execution time(clocks)- Frequency of Appearance(%)
Instruction 1 10 60
Instruction 2 15 40
Answer: 125
2.)
There is a hard disk drive with specifications shown below. When a record of 15
Kbytes is processed, which of the following is the average access time in milliseconds?
Here, the record is stored in one track.
[Specifications]
Capacity: 25 Kbytes/track
Rotation speed: 2,400 revolutions/minute
Average seek time: 10 milliseconds
Answer: 37.5
3.)
Assume a magnetic disk has a rotational speed of 5,000 rpm, and an average seek time of 20 ms. The recording capacity of one track on this disk is 15,000 bytes. What is the average access time (in milliseconds) required in order to transfer one 4,000-byte block of data?
Answer: 29.2
4.)
When a color image is stored in video memory at a tonal resolution of 24 bits per pixel,
approximately how many megabytes (MB) are required to display the image on the
screen with a resolution of 1024 x768 pixels? Here, 1 MB is 106 bytes.
Answer:18.9
5.)
When a microprocessor works at a clock speed of 200 MHz and the average CPI
(“cycles per instruction” or “clocks per instruction”) is 4, how long does it take to
execute one instruction on average?
Answer: 20 nanoseconds
I dont expect someone to answer everything, although they are indeed already answered but i am just wondering and wanting to know how it arrived at those answers. Its not enough for me knowing the answer, ive tried solving it myself trial and error style to arrive at those numbers but it seems taking mins to hours so i need some professional help....
1.)
n = 1/f = 1 / 1 GHz = 1 ns.
n*10 * 0.6 + n*15 * 0.4 = 12 ns (=average instruction time) = 83.3 MIPS.
2.)3.)
I don't get these, honestly.
4.)
Here, 1 MB is 10^6 bytes.
3 Bytes * 1024 * 768 = 2359296 Bytes = 2.36 MB
But often these 24 bits are packed into 32 bits b/c of the memory layout (word width), so often it will be 4 Bytes*1024*768 = 3145728 Bytes = 3.15 MB.
5)
CPI / f = 4 / 200 MHz = 20 ns.

Using time command for benchmarking

I'm trying to use the time command as a simple solution for benchmarking some scripts that do a lot of text processing and makes a number of network calls. To evaluate if its a good fit, I tried doing:
/usr/bin/time -f "\n%E elapsed,\n%U user,\n%S system, \n %P CPU, \n%M
max-mem footprint in KB, \n%t avg-mem footprint in KB, \n%K Average total
(data+stack+text) memory,\n%F major page faults, \n%I file system
inputs by the process, \n%O file system outputs by the process, \n%r
socket messages received, \n%s socket messages sent, \n%x status" yum
install nmap
and got:
1:35.15 elapsed,
3.17 user,
0.40 system,
3% CPU,
0 max-mem footprint in KB,
0 avg-mem footprint in KB,
0 Average total (data+stack+text) memory,
127 major page faults,
0 file system inputs by the process,
0 file system outputs by the process,
0 socket messages received,
0 socket messages sent,
0 status
which is not exactly what I was expecting - specially the 0 values. Even when I change the command to say ping google.com, the socket messages are 0. What's going on? Is there any alternative?
[And I'm confused if it should stay here or be posted in serverfault]
I think it's not working with Linux; I assume you're using Linux since you said "strace". The manual page says:
Bugs
Not all resources are measured by all versions of Unix,
so some of the values might be reported as zero. The present
selection was mostly inspired by the data provided by 4.2 or
4.3BSD.
I tried "wget" on an OSX system (which is BSD-ish) to check if it report socket statistics, and there at least socket works:
0.00 user,
0.01 system,
1% CPU,
0 max-mem footprint in KB,
0 avg-mem footprint in KB,
0 Average total (data+stack+text) memory,
0 major page faults,
0 file system inputs by the process,
0 file system outputs by the process,
151 socket messages received,
8 socket messages sent,
0 status
Hope that helps,
Alex.
Do not use time to benchmark. Some of the fields of the time command is broken as specified in [1]. However the basic functionality of time (real , user and cpu time) are still intact.
[1] Maximum resident set size does not make sense

Resources