Why is Kotlin's memory footprint so large?

Why is Kotlin's memory footprint so large? - performance

On my machine, Kotlin uses over 200MB of memory just to start up:
$ /usr/bin/time --format "%M kB" kotlin-1.7.0 -expression 0
0
222908 kB
This the worst memory footprint of all the languages I have tested (including Julia, which I thought was pretty bad). For comparison:
$ /usr/bin/time --format "%M kB" julia-1.7 -e0
152796 kB
$ /usr/bin/time --format "%M kB" scala-3.1.2 zero.jar
71684 kB
$ /usr/bin/time --format "%M kB" racket-7.2 -e 0
67280 kB
$ /usr/bin/time --format "%M kB" nodejs-10.21.0 -e 0
34752 kB
$ /usr/bin/time --format "%M kB" java Zero // java 17.0.2
33432 kB
$ /usr/bin/time --format "%M kB" python-3.7.3 -c0
9248 kB
$ /usr/bin/time --format "%M kB" ./zero.kexe // kotlin-1.7.0 native
2172 kB
Am I doing something wrong? Should I be launching kotlin apps using the java command? If so, what is the correct classpath to use?
Note: some of the tests above use source files which I haven't included, but simply return 0 from the main function, or an equivalent null-op.

As suggested by other commentors, when I precompile to a jar the memory footprint goes down to 45MB (executed with kotlin) or 36MB (executed with java). So effectively I was measuring the memory footprint of the compiler.
Aside: I also found an article describing how to launch kotlin classes from the command line with the correct classpath. This way, I can avoid the same mistake again by running kotlin apps with java instead of the multipurpose kotlin tool.

Related

How to determine when Docker containers (on an M1 MacBook) are running via qemu?

It has been mentioned that when employing x86_64 Docker images on an M1 Mac, when no ARM64 image is available, that container will start under qemu emulation for compatibility. (At the cost of performance.)
Often times when I'm running a collection of containers (and integration tests against the lot), I'll see qemu-system-aarch64 pegging a few cores.
My question: How can I determine, for a given list of running containers (ie. docker ps), which ones are running natively and which are being emulated?

This is true also for Docker running on amd64 CPU, when the image is build for arm64, and the whole mechanism is explained in this SO
The mechanism for emulation is using the information in the elf to recognize the architecture for which the process is build, and if the architecture of the binary differs from the architecture of the CPU, it starts the qemu emulation. Though the recognizing of the architecture is more related to the process, there is still information about the targeted architecture of the docker image. The targeted architecture is determined from the "Architecture" flag on the image which was set when the image was build. Any of the containers that will run the image will be associated (trough the image) with this flag.
It should be noted that the "Architecture" flag on the image will not prevent a single process inside the image, which is compiled for a different architecture than the flagged one to run. The reason for this is that bitfmt (which is the underlying mechanism sitting inside the linux kernel) will always try to recognize the architecture from the magic numbers of the elf and will start the the emulation if the magic number is recognized.
To list the architecture of the containers, you can use the following "quick" query:
for i in `docker ps --format "{{.Image}}"` ; do docker image inspect $i --format "$i -> {{.Architecture}} : {{.Os}}" ;done
The command will print the container name, the architecture and the os of the image.
To avoid typing this command multiple times, you can add alias in .bashrc as follows:
alias docker-arch-ps='for i in `docker ps --format "{{.Image}}"` ; do docker image inspect $i --format "$i -> {{.Architecture}} : {{.Os}}" ;done';
After this, you can use simple docker-arch-ps to get the list of the containers and their architecture.

As an improvement of the #jordanvrtanoski's answer, I've done two additional commands
docker-ps-arch:
#!/bin/bash
OPT=$#
set -euo pipefail
docker container ls $OPT --format "{{.ID}}\t{{.Image}}\t{{.Command}}\t{{.Status}}\t{{.Names}}" |
awk -F '\t' 'BEGIN {OFS=FS} { "docker image inspect --format \"{{.Os}}/{{.Architecture}}\" "$2" #"NR | getline $6; print }' |
column --table --table-columns "CONTAINER ID,IMAGE,COMMAND,STATUS,NAME,ARCH" -o ' ' -s $'\t'
and
docker-images-arch:
#!/bin/bash
OPT=$#
set -euo pipefail
docker image ls $OPT --format "{{.Repository}}\t{{.Tag}}\t{{.ID}}\t{{.Size}}" |
awk -F '\t' 'BEGIN {OFS=FS} { "docker image inspect --format \"{{.Os}}/{{.Architecture}}\" "$3" #"NR | getline $5; print }' |
column --table --table-columns "REPOSITORY,TAG,IMAGE ID,SIZE,ARCH" -o ' ' -s $'\t'
They produce outputs close to the original commands and support options of docker container ls and docker image ls.
$ docker-ps-arch -a
CONTAINER ID IMAGE COMMAND STATUS NAME ARCH
261767e38db2 hello-world "/hello" Exited (0) About an hour ago practical_moore linux/amd64
16e364572d08 18e5af790473 "/hello" Exited (0) 3 hours ago peaceful_lalande linux/arm64
PS: the column command used here is the one from util-linux not the one from BSD utils. util-linux is a standard package distributed by the Linux Kernel Organization. On macOS, to get it, use brew install util-linux; rockylinux uses it by default and unfortunately on Debian/Ubuntu, the opposite choice has been done (cf https://askubuntu.com/q/1098248).

Run million of list in PBS with parallel tool

I've huge size(few million) job contain list and wants to run java written tool to perform the features comparison. This tool completes the calculation in
real 0m0.179s
user 0m0.005s
sys 0m0.000s sec
Running 5 nodes(each have 72 cpus) with pbs torque scheduler in the GNU parallel, tool runs fine and produces the results but as I set 72 jobs per node, it should run 72 x 5 jobs at a time but I can see only it runs 25-35 jobs!
Checking of cpu utilization on each node also shows low utilization.
I desire to run 72 X 5 jobs or more at a time and produce the results by utilizing all the available source (72 X 5 cpus).
As I mentioned have ~200 millions of job to run, I desire to complete it faster(1-2 hours) by using/increasing the number of nodes/cpus.
Current code, input and job state:
example.lst (it has ~300 million lines)
ZNF512-xxxx_2_N-THRA-xxtx_2_N
ZNF512-xxxx_2_N-THRA-xxtx_3_N
ZNF512-xxxx_2_N-THRA-xxtx_4_N
.......
cat job_script.sh
#!/bin/bash
#PBS -l nodes=5:ppn=72
#PBS -N job01
#PBS -j oe
#work dir
export WDIR=/shared/data/work_dir
cd $WDIR;
# use available 72 cpu in each node
export JOBS_PER_NODE=72
#gnu parallel command
parallelrun="parallel -j $JOBS_PER_NODE --slf $PBS_NODEFILE --wd $WDIR --joblog process.log --resume"
$parallelrun -a example.lst sh run_script.sh {}
cat run_script.sh
#!/bin/bash
# parallel command options
i=$1
data=/shared/TF_data
# create tmp dir and work in
TMP_DIR=/shared/data/work_dir/$i
mkdir -p $TMP_DIR
cd $TMP_DIR/
# get file name
mk=$(echo "$i" | cut -d- -f1-2)
nk=$(echo "$i" | cut -d- -f3-6)
#run a tool to compare the features of pair files
/shared/software/tool_v2.1/tool -s1 $data/inf_tf/$mk -s1cf $data/features/$mk-cf -s1ss $data/features/$mk-ss -s2 $data/inf_tf/$nk.pdb -s2cf $data/features/$nk-cf.pdb -s2ss $data/features/$nk-ss.pdb > $data/$i.out
# move output files
mv matrix.txt $data/glosa_tf/matrix/$mk"_"$nk.txt
mv ali_struct.pdb $data/glosa_tf/aligned/$nk"_"$mk.pdb
# move back and remove tmp dir
cd $TMP_DIR/../
rm -rf $TMP_DIR
exit 0
PBS submission
qsub job_script.sh
Login to one of the node : ssh ip-172-31-9-208
top - 09:28:03 up 15 min, 1 user, load average: 14.77, 13.44, 8.08
Tasks: 928 total, 1 running, 434 sleeping, 0 stopped, 166 zombie
Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 98.4%id, 1.4%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 193694612k total, 1811200k used, 191883412k free, 94680k buffers
Swap: 0k total, 0k used, 0k free, 707960k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15348 ec2-user 20 0 16028 2820 1820 R 0.3 0.0 0:00.10 top
15621 ec2-user 20 0 169m 7584 6684 S 0.3 0.0 0:00.01 ssh
15625 ec2-user 20 0 171m 7472 6552 S 0.3 0.0 0:00.01 ssh
15626 ec2-user 20 0 126m 3924 3492 S 0.3 0.0 0:00.01 perl
.....
All of the nodes top shows the similar state and produces the results by running only ~26 at a time!
I've aws-parallelcluster contains 5 nodes(each have 72 cpus) with torque scheduler and GNU Parallel 2018, Mar 2018
Update
By introducing the new function that takes input on stdin and running the script in parallel works great and utilizes all the CPU in local machine.
However, when its runs over remote machines it produces a
parallel: Error: test.lst is neither a file nor a block device
MCVE:
A simple code that echoing list gives the same error while running it in remote machines but works great in local machine:
cat test.lst # contains list
DNMT3L-5yx2B_1_N-DNMT3L-5yx2B_2_N
DNMT3L-5yx2B_1_N-DNMT3L-6brrC_3_N
DNMT3L-5yx2B_1_N-DNMT3L-6f57B_2_N
DNMT3L-5yx2B_1_N-DNMT3L-6f57C_2_N
DNMT3L-5yx2B_1_N-DUX4-6e8cA_4_N
DNMT3L-5yx2B_1_N-E2F8-4yo2A_3_P
DNMT3L-5yx2B_1_N-E2F8-4yo2A_6_N
DNMT3L-5yx2B_1_N-EBF3-3n50A_2_N
DNMT3L-5yx2B_1_N-ELK4-1k6oA_3_N
DNMT3L-5yx2B_1_N-EPAS1-1p97A_1_N
cat test_job.sh # GNU parallel submission script
#!/bin/bash
#PBS -l nodes=1:ppn=72
#PBS -N test
#PBS -k oe
# introduce new function and Run from ~/
dowork() {
parallel sh test_work.sh {}
}
export -f dowork
parallel -a test.lst --env dowork --pipepart --slf $PBS_NODEFILE --block -10 dowork
cat test_work.sh # run/work script
#!/bin/bash
i=$1
data=pwd
#create temporary folder in current dir
TMP_DIR=$data/$i
mkdir -p $TMP_DIR
cd $TMP_DIR/
# split list
mk=$(echo "$i" | cut -d- -f1-2)
nk=$(echo "$i" | cut -d- -f3-6)
# echo list and save in echo_test.out
echo $mk, $nk >> $data/echo_test.out
cd $TMP_DIR/../
rm -rf $TMP_DIR

From your timing:
real 0m0.179s
user 0m0.005s
sys 0m0.000s sec
it seems the tool uses very little CPU power. When GNU Parallel runs local jobs it has an overhead of 10 ms CPU time per job. Your jobs use 179 ms time, and 5 ms CPU time. So GNU Parallel will be using quite a bit of the time spent.
The overhead is much worse when running jobs remotely. Here we are talking 10 ms + running an ssh command. This can easily be in the order of 100 ms.
So how can we minimize the number of ssh commands and how can spread the overhead over multiple cores?
First let us make a function that can take input on stdin and run the script - one job per CPU thread in parallel:
dowork() {
[...set variables here. that becomes particularly important we when run remotely...]
parallel sh run_script.sh {}
}
export -f dowork
Test that this actually works by running:
head -n 1000 example.lst | dowork
Then let us look at running jobs locally. This can be done similar to described here: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Running-more-than-250-jobs-workaround
parallel -a example.lst --pipepart --block -10 dowork
This will split example.lst into 10 blocks per CPU thread. So on a machine with 72 CPU threads this will make 720 blocks. It will the start 72 doworks and when one is done it will get another of the 720 blocks. The reason I choose 10 instead of 1 is if one of the jobs "get stuck" for a while, then you are unlikely to notice this.
This should make sure 100% of the CPUs on the local machine is busy.
If that works, we need to distribute this work to remote machines:
parallel -j1 -a example.lst --env dowork --pipepart --slf $PBS_NODEFILE --block -10 dowork
This should in total start 10 ssh per CPU thread (i.e. 5*72*10) - namely one for each block. With 1 running per server listed in $PBS_NODEFILE in parallel.
Unfortunately this means that --joblog and --resume will not work. There is currently no way to make that work, but if it is valuable to you contact me via parallel#gnu.org.

I am not sure what tool does. But if the copying takes most of the time and if tool only reads the files, then you might just be able symlink the files into $TMP_DIR instead of copying.
A good indication of whether you can do it faster is to look at top of the 5 machines in the cluster. If they are all using all cores at >90% then you cannot expect to get it faster.

Efficient method for parallel processing in Bash /Shell ?

I have a text file ( Input.txt ) containing domains and that is total of about 35 Millions domains.
#Input.txt
google.com
cnn.com
bbc.com
........
Now ,I have a python script to check the status code of each and every domains associated with in the text file ( Input.txt ). For smaller set, I do
for i in $(cat Input.txt);do python status_check.py $i;done > out_file.txt
If i process in this manner,It might take ages to check the status code for all 35 million domains.
I'm not familiar in parallel processing. Can some one help me on,How to achieve the task by saving time using shell/bash/any ?

You are looking for GNU Parallel:
cat Input.txt | parallel -j 100 python status_check.py > out_file.txt
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 883c667e01eed62f975ad28b6d50e22a
12345678 883c667e 01eed62f 975ad28b 6d50e22a
$ md5sum install.sh | grep cc21b4c943fd03e93ae1ae49e28573c0
cc21b4c9 43fd03e9 3ae1ae49 e28573c0
$ sha512sum install.sh | grep da012ec113b49a54e705f86d51e784ebced224fdf
79945d9d 250b42a4 2067bb00 99da012e c113b49a 54e705f8 6d51e784 ebced224
fdff3f52 ca588d64 e75f6033 61bd543f d631f592 2f87ceb2 ab034149 6df84a35
$ bash install.sh
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Put an ampersand after your $1 and it will run each "concurrently"
Bash is probably not the right tool to do this. Each fork is very expensive resource-wise. You'd be better off using Ruby or Python, reading this into an array and then processing it inside the interpreter's VM.

Why not alter your python script to read the URLs itself and then distribute the processing?
It seems a bit pointless having a bash for-loop when you could just do that in python.
There are a number of modules in python for handling parallel processing listed here.

parallel check md5 file

I have a md5sum file containing lots of lines. I want to use GNU parallel to accelerate the md5sum checking process. In the md5sum, when no file input, it will take the md5 string from stdin. I tried this:
cat checksums.md5 | parallel md5sum -c {}
But getting this error:
md5sum 445350b414a8031d9dd6b1e68a6f2367 testing.gz: No such file or directory
How can I parallel the md5sum checking?

Assuming checksums.md5 has the format:
d41d8cd98f00b204e9800998ecf8427e My file name
Run:
cat checksums.md5 | parallel --pipe -N1 md5sum -c
If your files are small: -N100
If that does not speed up your processing make sure your disks are fast enough: md5sum can process 500 MB/s. iostat -dkx 1 can tell you if your disks are a bottleneck.

You need option --pipe. In this mode parallel splits stdin into blocks and supplies each block to the command via stdin, see man parallel for details:
cat checksums.md5 | parallel --pipe md5sum -c -
By default size of the block is 1 MB, can be changed with --block option.

How to get dd to print transfer stats in MacOS?

For MacOS (Mavericks), I am making a shell script to gather transfer stats over time for command dd.
The manual page says:
If dd receives a SIGINFO (see the status argument for stty(1)) signal,
the current input and output block counts will be written to the
standard error output in the same format as the standard completion
message.
Therefore, just like in Linux, I tried:
kill -INFO <pid_of_dd>
The command completes successfully with status 0, however the terminal in which dd process connected to, there is no stats information in standard output/standard error.
So what is the correct way to get dd to print stats in its output?

You can also press Ctrl+T in the Terminal tab to get the same behavior:
MacBook-Pro:~ $ dd if=~/source_image.dmg of=/dev/disk1
load: 0.87 cmd: dd 7229 uninterruptible 0.21u 3.91s
265809+0 records in
265808+0 records out
136093696 bytes transferred in 131.170628 secs (1037532 bytes/sec)
load: 0.99 cmd: dd 7229 uninterruptible 0.32u 5.89s
415769+0 records in
415768+0 records out
212873216 bytes transferred in 203.357068 secs (1046795 bytes/sec)

It seems to work for me:
$ dd if=/dev/zero of=/dev/null bs=1k &
[1] 33990
$ kill -INFO 33990
4787784+0 records in
4787784+0 records out
4902690816 bytes transferred in 4.260769 secs (1150658706 bytes/sec)
$ kill -INFO 33990
8357846+0 records in
8357846+0 records out
8558434304 bytes transferred in 7.428820 secs (1152058392 bytes/sec)
$ kill 33990
$ ps
PID TTY TIME CMD
1342 ttys000 0:00.02 -bash
2290 ttys001 0:00.17 -bash
[1]+ Terminated: 15 dd if=/dev/zero of=/dev/null bs=1k
$

I also found via commandlinefu that you can also do:
killall -INFO dd
If you had to run sudo dd to start dd you might try:
sudo killall -INFO dd
Also, I started dd in the background and with nohup so when I ran sudo killall -INFO dd and got nothing back for output I had to remember to go and look at the nohup.out file because that is where the response was logged to.
Worked great on OS X Mavericks.

You can press Ctrl+T while the dd command is running or, to have a nice progress bar, you can install pv (pipe viewer) with homebrew:
brew install pv
and then place pv in between
dd if=diskimage.img | pv | dd of=/dev/disk2
example output 1
18MB 0:00:11 [1.70MiB/s] [ <=> ]
(with size of transferred data, elapsed time and speed)
Progress bar and ETA
you can also input the size of the image (16GB in this example), to have :
dd if=diskimage.img | pv -s 16G | dd of=/dev/disk2
example output 2 (with also progress bar and estimated time):
1.61GiB 0:12:19 [2.82MiB/s] [===> ] 10% ETA 1:50:25

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Why is Kotlin's memory footprint so large? - performance

Related

How to determine when Docker containers (on an M1 MacBook) are running via qemu?

Run million of list in PBS with parallel tool

Efficient method for parallel processing in Bash /Shell ?

parallel check md5 file

How to get dd to print transfer stats in MacOS?

Categories

Resources