Speed of read from standard in versus file in Matlab Compiled code

Speed of read from standard in versus file in Matlab Compiled code - performance

I have the following code to read from standard in, in my compiled Matlab code, which I compiled using mcc -m -T link:exe -N -v -R -nojvm -d build test1.m -o test1
function test1()
% First, read input stream
tic;
stdin=char(0);
while ~strcmp(stdin,'EOF')
%while ~isempty(stdin)
stdin=input(char(0),'s');
end
toc;
% Now, read file
tic;
fid=fopen('test1read.txt');
tline=fgetl(fid);
while ischar(tline)
tline=fgetl(fid);
strcmp(tline,'EOF');
end
fclose(fid);
toc;
end
I executed it as type test1stream.txt | test1.exe (on Windows)
For an order 1.5MB file, I get timing output as
Elapsed time is 1.000616 seconds.
Elapsed time is 0.156772 seconds.
I was a bit surpised to see that reading from input is slower than reading from a file.
For an order 1.5GB file (longer and more lines), I get a timing as
Elapsed time is 78.877386 seconds.
Elapsed time is 95.457972 seconds
My first question would be: why is that? I think one reason is that input() is doing more things than necessary. Is this assumption correct?
My second question is: can I speed up the read from input in a stand-alone tool?
See also: Using standard io stream:stdin and stdout in a matlab exe

Related

How to make squeue display time limits in hours only?

When viewing submitted jobs managed by Slurm, I would like to have the time limit column (specified by %l) to show only hours, instead of the usual days-hours:minutes:seconds format. This is the command I am currently using:
squeue --format="%.6i %.5P %.25j %.8u %.8T %.10M %.5l %.15b %.5C %.6D %R" --sort=+i --me
and this is the example output:
276350 qgpu jobname username RUNNING 1:14:14 1-00:00:00 gres:gpu:v100:1 18 1 s31n02
So, in this case, I would like the elapsed time to remain as is (1:14:14), but the time limit to change from 1-00:00:00 to 24. Is there a way to do it?

This is the way Slurm displays the dates. Elapsed time will eventually be displayed the same way (days-hours:minutes:seconds) after 23:59:59.
You can use a wrapper script to convert into a different format. Or if you know the time limit is no more than a day, just set the time limit to 23:59:00 by using --time=1439.
salloc -N1 --time=1439 bash
Using your squeue command:
166 mypartition interactive jyvet RUNNING 7:36 23:59:00 N/A 1 1 mynode

IntMap traverseWithKey faster than mapWithKey?

I was reading this part of Parallel and Concurrent Programming in Haskell, and found the sequential version of the program to be far slower than the parallel one with one core:
Sequential:
$ cabal run fwsparse -O2 --ghc-options="-rtsopts" -- 1000 800 +RTS -s
...
Total time 11.469s ( 11.529s elapsed)
Parallel:
$ cabal run fwsparse1 -O2 --ghc-options="-rtsopts -threaded" -- 1000 800 +RTS -s -N1
...
Total time 4.906s ( 4.988s elapsed)
According to the book, the sequential one should be slightly faster than the parallel one (if it runs on a single core).
The only difference in the two programs was this function:
Sequential:
update g k = Map.mapWithKey shortmap g
Parallel:
update g k = runPar $ do
m <- Map.traverseWithKey (\i jmap -> spawn (return (shortmap i jmap))) g
traverse get m
Since the spawn function uses deepseq, I initially though it had something to do with strictness but the use of force didn't change the performance of the sequential program.
Finally, I managed to get the sequential one work faster:
$ cabal run fwsparse -O2 --ghc-options="-rtsopts" -- 1000 800 +RTS -s
...
Total time 3.891s ( 3.901s elapsed)
by changing the update function to this:
update g k = runIdentity $ Map.traverseWithKey (\i jmap -> pure $ shortmap i jmap) g
Why does using traverseWithKey in the Identity monad speed up performance? I checked the IntMap source code but couldn't figure out the reason.
Is this a bug or the expected behaviour?
(Also, this is my first question on StackOverflow so please tell me if I'm doing anything wrong.)
EDIT:
As per Ismor's comment, I turned off optimization (using -O0) and the sequential program runs in 19.5 seconds with either traverseWithKey or mapWithKey, while the parallel one runs in 21.5 seconds.
Why is traverseWithKey optimized to be so much faster than mapWithKey?
Which function should I use in practice?

parallel computing in multiple cores for data which is indepedently run with the program

I have a simulation program in fortran which takes the input from a .dat. This file has 100.000 lines which takes really long to run. The program take the first line, run all the simulations and write in a .out the result and pass to the next line. I have a computer with 16 cpu so how can I do to split my data in 16 parts and run it separatly in each of the cpus? I am running in a machine with ubuntu. It is totally independent each line from the other.
For example my data is HeadData10000.dat, then I have a file simulation.ini with the name of the input data in this case: HeadData10000.dat and with the name of the output data. So the file simulation.ini will look like that
HeadData10000.dat
outputdata.out
Then now I have two computer so I split my HeadData10000.dat y two files and I do two simulation.ini for each input data and I run it like this in each computer: ./simulation.exe<./simulation.ini.

Assuming your list of 100,000 jobs is called "jobs.txt" and looks like this:
JobA
JobB
JobC
JobD
You could run this:
parallel 'printf "{}\n{.}.out" | ./simulation.exe' < jobs.txt
If you want to do a dry run to see what that would do without doing anything:
parallel --dry-run 'printf "{}\n{.}.out" | ./simulation.exe' < jobs.txt
Sample Output
printf "JobA\nJobA.out" | ./simulation.exe
printf "JobB\nJobB.out" | ./simulation.exe
printf "JobC\nJobC.out" | ./simulation.exe
printf "JobD\nJobD.out" | ./simulation.exe
If you have multiple servers available, look at using the -S parameter to GNU Parallel to spread the jobs across the machines. Also, look at the --eta and --bar parameters for getting progress reports.
I used printf "line1 \n line2" to generate two lines of input in order to avoid having to create, and later delete 100,000 files.
By default, GNU Parallel will keep 1 job per CPU core running, so there will always be 16 jobs running on your 16-core machine, but you can change that to, say, 8 if you want to with parallel -j 8. You can also specify the number of jobs to run on your second (and subsequent) machines.

How do I improve the performance of an read-write intensive imagemagick script?

I use a bash script to process a bunch of images for a timelapse movie. The method is called shutter drag, and i am creating a moving average for all images. The following script works fine:
#! /bin/bash
totnum=10000
seqnum=40
skip=1
num=$(((totnum-seqnum)/1))
i=1
j=1
while [ $i -le $num ]; do
echo $i
i1=$i
i2=$((i+1))
i3=$((i+2))
i4=$((i+3))
i5=$((i+4))
...
i37=$((i+36))
i38=$((i+37))
i39=$((i+38))
i40=$((i+39))
convert $i1.jpg $i2.jpg $i3.jpg $i4.jpg $i5.jpg ... \
$i37.jpg $i38.jpg $i39.jpg $i40.jpg \
-evaluate-sequence mean ~/timelapse/Images/Shutterdrag/$j.jpg
i=$((i+$skip))
j=$((j+1))
done
However, i noticed that this script takes a very long time to process a lot of images with a large average window (1s per image). I guess, this is caused by a lot of reading and writing in the background.
Is it possible to increase the speed of this script? For example by storing the images in the memory, and with every iteration deleting the first, and loading the last image only.
I discovered the mpr:{label} function of imagemagick, but i guess this is not the right approach, as the memory is cleared after the convert command?

Suggestion 1 - RAMdisk
If you want to put all your files on a RAMdisk before you start, it should help the I/O speed enormously.
So, to make a 1GB RAMdisk, use:
sudo mkdir /RAMdisk
sudo mount -t tmpfs -o size=1024m tmpfs /RAMdisk
Suggestion 2 - Use MPC format
So, assuming you have done the previous step, convert all your JPEGs to MPC format files on the RAMdisk. The MPC file can be dma'ed straight into memory without your CPU needing to do costly JPEG decoding as MPC is just the same format as ImageMagick uses in memory, but on-disk.
I would do that with GNU Parallel like this:
parallel -X mogrify -path /RAMdisk -fmt MPC ::: *.jpg
The -X passes as many files as possible to mogrify without creating loads of convert processes. The -path says where the output files must go. The -fmt MPC makes mogrify convert the input files to MPC format (Magick Pixel Cache) files which your subsequent convert commands in the loop can read by pure DMA rather than expensive JPEG decoding.
If you don't have, or don't like, GNU Parallel, just omit the leading parallel -X and the :::.
Suggestion 3 - Use GNU Parallel
You could also run #chepner's code in parallel...
for ...; do
echo convert ...
done | parallel
Essentially, I am echoing all the commands instead of running them and the list of echoed commands is then run by GNU Parallel. This could be especially useful if you cannot compile ImageMagick with OpenMP as Eric suggested.
You can play around with switches such as --eta after parallel to see how long it will take to finish, or --progress. Also, experiment with -j 2 or -j4 depending how big your machine is.
I did some benchmarks, just for fun. First, I made 250 JPEG images of random noise at 640x480, and ran chepner's code "as-is" - that took 2 minutes 27 seconds.
Then, I used the same set of images, but changed the loop to this:
for ((i=1, j=1; i <= num; i+=skip, j+=1)); do
echo convert "${files[#]:i:seqnum}" -evaluate-sequence mean ~/timelapse/Images/Shutterdrag/$j.jpg
done | parallel
The time went down to 35 seconds.
Then I put the loop back how it was, and changed all the input files to MPC instead of JPEG, the time went down to 36 seconds.
Finally, I used MPC format and GNU Parallel as above and the time dropped to 19 seconds.
I didn't use a RAMdisk as I am on a different OS from you (and have extremely fast NVME disks), but that should help you enormously too. You could write your output files to RAMdisk too, and also in MPC format.
Good luck and let us know how you get on please!

There is nothing you can do in bash to speed this up; everything except the actual IO that convert has to do is pretty trivial. However, you can simplify the script greatly:
#! /bin/bash
totnum=10000
seqnum=40
skip=1
num=$(((totnum-seqnum)/1))
# Could use files=(*.jpg), but they probably won't be sorted correctly
for ((i=1; i<=totnum; i++)); do
files+=($i.jpg)
done
for ((i=1, j=1; i <= num; i+=skip, j+=1)); do
convert "${files[#]:i:seqnum}" -evaluate-sequence mean ~/timelapse/Images/Shutterdrag/$j.jpg
done
Storing the files in a RAM disk would certainly help, but that's beyond the scope of this site. (Of course, if you have enough RAM, the OS should probably be keeping a file in disk cache after it is read the first time so that subsequent reads are much faster without having to preload a RAM disk.)

fastest hashing in a unix environment?

I need to examine the output of a certain script 1000s of times on a unix platform and check if any of it has changed from before.
I've been doing this:
(script_stuff) | md5sum
and storing this value. I actually don't really need "md5", JUST a simple hash function which I can compare against a stored value to see if its changed. Its okay if there are an occassional false positive.
Is there anything better than md5sum that works faster and generates a fairly usable hash value? The script itself generates a few lines of text - maybe 10-20 on average to max 100 or so.
I had a look at fast md5sum on millions of strings in bash/ubuntu - that's wonderful, but I can't compile a new program. Need a system utility... :(
Additional "background" details:
I've been asked to monitor the DNS record of a set of 1000 or so domains and immediately call certain other scripts if there has been any change. I intend to do a dig xyz +short statement and hash its output and store that, and then check it against a previously stored value. Any change will trigger the other script, otherwise it just goes on. Right now, we're planning on using cron for a set of these 1000, but can think completely diffeerently for "seriously heavy" usage - ~20,000 or so.
I have no idea what the use of such a system would be, I'm just doing this as a job for someone else...

The cksum utility calculates a non-cryptographic CRC checksum.

How big is the output you're checking? A hundred lines max. I'd just save the entire original file then use cmp to see if it's changed. Given that a hash calculation will have to read every byte anyway, the only way you'll get an advantage from a checksum type calculation is if the cost of doing it is less than reading two files of that size.
And cmp won't give you any false positives or negatives :-)
pax> echo hello >qq1.txt
pax> echo goodbye >qq2.txt
pax> cp qq1.txt qq3.txt
pax> cmp qq1.txt qq2.txt >/dev/null
pax> echo $?
1
pax> cmp qq1.txt qq3.txt >/dev/null
pax> echo $?
0
Based on your question update:
I've been asked to monitor the DNS record of a set of 1000 or so domains and immediately call certain other scripts if there has been any change. I intend to do a dig xyz +short statement and hash its output and store that, and then check it against a previously stored value. Any change will trigger the other script, otherwise it just goes on. Right now, we're planning on using cron for a set of these 1000, but can think completely diffeerently for "seriously heavy" usage - ~20,000 or so.
I'm not sure you need to worry too much about the file I/O. The following script executed dig microsoft.com +short 5000 times first with file I/O then with output to /dev/null (by changing the comments).
#!/bin/bash
rm -rf qqtemp
mkdir qqtemp
((i = 0))
while [[ $i -ne 5000 ]] ; do
#dig microsoft.com +short >qqtemp/microsoft.com.$i
dig microsoft.com +short >/dev/null
((i = i + 1))
done
The elapsed times at 5 runs each are:
File I/O | /dev/null
----------+-----------
3:09 | 1:52
2:54 | 2:33
2:43 | 3:04
2:49 | 2:38
2:33 | 3:08
After removing the outliers and averaging, the results are 2:49 for the file I/O and 2:45 for the /dev/null. The time difference is four seconds for 5000 iterations, only 1/1250th of a second per item.
However, since an iteration over the 5000 takes up to three minutes, that's how long it will take maximum to detect a problem (a minute and a half on average). If that's not acceptable, you need to move away from bash to another tool.
Given that a single dig only takes about 0.012 seconds, you should theoretically do 5000 in sixty seconds assuming your checking tool takes no time at all. You may be better off doing something like this in Perl and using an associative array to store the output from dig.
Perl's semi-compiled nature means that it will probably run substantially faster than a bash script and Perl's fancy stuff will make the job a lot easier. However, you're unlikely to get that 60-second time much lower just because that's how long it takes to run the dig commands.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio