Gnuplot wrapper for parallel processing

Gnuplot wrapper for parallel processing - parallel-processing

I need to process a big number of data file with gnuplot in order to produce images which are collected in a movie. As the procedure is time consuming I would like to produce the frames in parallel and a small message should be printed from time to time to inform the user of the progress.
I tried the makefile approach:
SOURCES=$(wildcard ./*.in)
OBJECTS=$(SOURCES:.in=.out)
all: $(OBJECTS)
%.out: %.in
./worker.sh $< $#
where worker.sh is:
gnuplot << EOF
set some_gnuplot_options
set output "$2"
plot "$1"
EOF
But:
I cannot print the progress messages,
I would prefer a single file solution (I have not succeeded in having the content of worker.sh directly in the makefile),
This solution introduces pretty much overhead with respect to a single gnuplot script wich contains all the instructions.
Probably the definitive solution would be to have a nice c++ interface to gnuplot, but I don't know very well the existing ones and I'm not sure how to do the job. Any other idea? Please avoid to imply new or not so common programs like GNU parallel as I cannot have them on some machines I use.

From your comment it sounds as if you are allowed to use your own scripts. GNU Parallel can be used as a script and does not need to be installed, and you can then create a file parallel_plotter:
#!/home/tange/bin/parallel --shebang-wrap -v A={} /usr/bin/gnuplot
name=system("echo $A")
set term png
set output name.".png"
plot sin(x*name)/x
Substitute /home/tange/bin/parallel with the full path to where you put the script parallel.
Then:
chmod 755 parallel_plotter
./parallel_plotter 1 2 3 4 5
This will print a line for each completed run.
To avoid the full path to /home/tange/bin/parallel I can come up with this solution:
#!/usr/bin/env gnuplot
name=system("echo $A")
set term png
set output name.".png"
plot sin(x*name)/x
Then:
chmod 755 parallel_plotter
parallel -v A={} ./parallel_plotter ::: 1 2 3 4 5
You are worried that spawning gnuplot will give a lot of overhead. I tested the above with:
./parallel_plotter {1..1000}
That took 10 secs. So the overhead of starting gnuplot on my system is less than 100 ms per job.

Related

Bash script that creates files of a set size

I'm trying to set up a script that will create empty .txt files with the size of 24MB in the /tmp/ directory. The idea behind this script is that Zabbix, a monitoring service, will notice that the directory is full and wipe it completely with the usage of a recovery expression.
However, I'm new to Linux and seem to be stuck on the script that generates the files. This is what I've currently written out.
today="$( date +¨%Y%m%d" )"
number=0
while test -e ¨$today$suffix.txt¨; do
(( ++number ))
suffix=¨$( printf -- %02d ¨$number¨ )
done
fname=¨$today$suffix.txt¨
printf ´Will use ¨%s¨ as filename\n´ ¨$fname¨
printf -c 24m /tmp/testf > ¨$fname¨
I'm thinking what I'm doing wrong has to do with the printf command. But some input, advice and/or directions to a guide to scripting are very welcome.
Many thanks,
Melanchole

I guess that it doesn't matter what bytes are actually in that file, as long as it fills up the temp dir. For that reason, the right tool to create the file is dd, which is available in every Linux distribution, often installed by default.
Check the manpage for different options, but the most important ones are
if: the input file, /dev/zero probably which is just an endless stream of bytes with value zero
of: the output file, you can keep the code you have to generate it
count: number of blocks to copy, just use 24 here
bs: size of each block, use 1MB for that

Bash: Trying to append to a variable name in the output of a function

this is my very first post on Stackoverflow, and I should probably point out that I am EXTREMELY new to a lot of programming. I'm currently a postgraduate student doing projects involving a lot of coding in various programs, everything from LaTeX to bash, MATLAB etc etc.
If you could explicitly explain your answers that would be much appreciated as I'm trying to learn as I go. I apologise if there is an answer else where that does what I'm trying to do, but I have spent a couple of days looking now.
So to the problem I'm trying to solve: I'm currently using a selection of bioinformatics tools to analyse a range of genomes, and I'm trying to somewhat automate the process.
I have a few sequences with names that look like this for instance (all contained in folders of their own currently as paired files):
SOL2511_S5_L001_R1_001.fastq
SOL2511_S5_L001_R2_001.fastq
SOL2510_S4_L001_R1_001.fastq
SOL2510_S4_L001_R2_001.fastq
...and so on...
I basically wish to automate the process by turning these in to variables and passing these variables to each of the programs I use in turn. So for example my idea thus far was to assign them as wildcards, using the R1 and R2 (which appears in all the file names, as they represent each strand of DNA) as follows:
#!/bin/bash
seq1=*R1_001*
seq2=*R2_001*
On a rudimentary level this works, as it returns the correct files, so now I pass these variables to my first function which trims the DNA sequences down by a specified amount, like so:
# seqtk is the program suite, trimfq is a function within it,
# and the options -b -e specify how many bases to trim from the beginning and end of
# the DNA sequence respectively.
seqtk trimfq -b 10 -e 20 $seq1 >
seqtk trimfq -b 10 -e 20 $seq2 >
So now my problem is I wish to be able to append something like "_trim" to the output file which appears after the >, but I can't find anything that seems like it will work online.
Alternatively, I've been hunting for a script that will take the name of the folder that the files are in, and create a variable for the folder name which I can then give to the functions in question so that all the output files are named correctly for use later on.
Many thanks in advance for any help, and I apologise that this isn't really much of a minimum working example to go on, as I'm only just getting going on all this stuff!
Joe
EDIT
So I modified #ghoti 's for loop (does the job wonderfully I might add, rep for you :D ) and now I append trim_, as the loop as it was before ended up giving me a .fastq.trim which will cause errors later.
Is there any way I can append _trim to the end of the filename, but before the extension?

Explicit is usually better than implied, when matching filenames. Your wildcards may match more than you expect, especially if you have versions of the files with "_trim" appended to the end!
I would be more precise with the wildcards, and use for loops to process the files instead of relying on seqtk to handle multiple files. That way, you can do your own processing on the filenames.
Here's an example:
#!/bin/bash
# Define an array of sequences
sequences=(R1_001 R2_001)
# Step through the array...
for seq in ${sequences[#]}; do
# Step through the files in this sequence...
for file in SOL*_${seq}.fastq; do
seqtk trimfq -b 10 -e 20 "$file" > "${file}.trim"
done
done
I don't know how your folders are set up, so I haven't addressed that in this script. But the basic idea is that if you want the script to be able to manipulate individual filenames, you need something like a for loop to handle the that manipulation on a per-filename basis.
Does this help?
UPDATE:
To put _trim before the extension, replace the seqtk line with the following:
seqtk trimfq -b 10 -e 20 "$file" > "${file%.fastq}_trim.fastq"
This uses something documented in the Bash man page under Parameter Expansion if you want to read up on it. Basically, the ${file%.fastq} takes the $file variable and strips off a suffix. Then we add your extra text, along with the suffix.
You could also strip an extension using basename(1), but there's no need to call something external when you can use something built in to the shell.

Instead of setting variables with the filenames, you could pipe the output of ls to the command you want to run with these filenames, like this:
ls *R{1,2}_001* | xargs -I# sh -c 'seqtk trimfq -b 10 -e 20 "$1" > "${1}_trim"' -- #
xargs -I# will grab the output of the previous command and store it in # to be used by seqtk

batch job submission upon completion of job

I would like to write a script to execute the steps outlined below. If someone can provide simple examples on how to modify files and search through folders using a script (not necessarily solving my problem below), I will greatly appreciate it.
submit job MyJob in currentDirectory using myJobShellFile.sh to a queue
upon completion of MyJob, goto to currentDirectory/myJobDataFolder.
In myJobDataFolder, there are folders
myJobData.0000 myJobData.0001 myJobData.0002 myJobData.0003
I want to find the maximum number maxIteration of all the listed folders. Here it would be maxIteration=0003.\
In file myJobShellFile.sh, at the last line says
mpiexec ./main input myJobDataFolder
I want to append this line to
'mpiexec ./main input myJobDataFolder 0003'
I want to submit MyJob to the que while maxIteration < 10
Upon completion of MyJob, find the new maxIteration and change this number in myJobShellFile.sh and goto step 4.
I think people write python scripts typically to do this stuff, but am having a hard time finding out how. I probably don't know the correct terminology for this procedure. I am also aware that the script will vary slightly depending on the queing system, but any help will be greatly appreciated.

Quite a few aspects of your question are unclear, such as the meaning of “submit job MyJob in currentDirectory using myJobShellFile.sh to a que”, “append this line to
'mpiexec ./main input myJobDataFolder 0003'”, how you detect when a job is done, relevant parts of myJobShellFile.sh, and some other details. If you can list the specific shell commands you use in each iteration of job submission, then you can post a better question, with a bash tag instead of python.
In the following script, I put a ### at the end of any line where I am guessing what you are talking about. Lines ending with ### may be irrelevant to whatever you actually do, or may be pseudocode. Anyway, the general idea is that the script is supposed to do the things you listed in your items 1 to 5. This script assumes that you have modified myJobShellFile.sh to say
mpiexec ./main input $1 $2
instead of
mpiexec ./main input
because it is simpler to use parameters to modify what you tell mpiexec than it is to keep modifying a shell script. Also, it seems to me you would want to increment maxIter before submitting next job, instead of after. If so, remove the # from the t=$((1$maxIter+1)); maxIter=${t#1} line. Note, see the “Parameter Expansion” section of man bash re expansion of the ${var#txt} form, and the “Arithmetic Expansion” section re $((expression)) form. The 1$maxIter and similar forms are used to change text like 0018 (which is not a valid bash number because 8 is not an octal digit) to 10018.
#!/bin/sh
./myJobShellFile.sh MyJob ###
maxIter=0
while true; do
waitforjobcompletion ###
cd ./myJobDataFolder
maxFile= $(ls myJobData* | tail -1)
maxIter= ${maxFile#myJobData.} #Get max extension
# If you want to increment maxIter, uncomment next line
# t=$((1$maxIter+1)); maxIter=${t#1}
cd ..
if [[ 1$maxIter -lt 11000 ]] ; then
./myJobShellFile.sh MyJobDataFolder $maxIter
else
break
fi
done
Notes: (1) To test with smaller runs than 1000 submissions, replace 11000 by 10000+n; for example, to do 123 runs, replace it with 10123. (2) In writing the above script, I assumed that not-previously-known numbers of output files appear in the output directory from time to time. If instead exactly one output file appears per run, and you just want to do one run per value for the values 0000, 0001, 0002, 0999, 1000, then use a script like the following. (For testing with a smaller number than 1000, replace 1000 with (eg) 0020. The leading zeroes in these numbers tell bash to fill the generated numbers with leading zeroes.)
#!/bin/sh
for iter in {0000..1000}; do
./myJobShellFile.sh MyJobDataFolder $iter
waitforjobcompletion ###
done
(3) If the system has a command that sleeps while it waits for a job to complete on the supercomputing resource, it is reasonable to use that command in place of waitforjobcompletion in the above scripts. Otherwise, if the system has a command jobisrunning that returns true if a job is still running, replace waitforjobcompletion with something like the following:
while jobisrunning ; do sleep 15; done
This will run the jobisrunning command; if it returns true, the shell will sleep for 15 seconds and then retest. Here is an example that illustrates waiting for a file to appear and then for it to go away:
while [ ! -f abc ]; do sleep 3; echo no abc; done
while ls abc >/dev/null 2>&1; do sleep 3; echo an abc; done
The second line's test could be [ -f abc ] instead; I showed a longer example to illustrate how to suppress output and error messages by routing them to /dev/null. (4) To reverse the sense of a while statement's test, replace the word while with until. For example, while [ ! -f abc ]; ... is equivalent to until [ -f abc ]; ....

bash script to rename files based on a calculation

I have a file system containing PNG images. The layout of the filesystem is: ZOOM/X/Y.png where ZOOM, X, and Y are all integers.
I need to change the names of the PNG files. Basically, I need to convert Y from its current value to 2^ZOOM-Y-1. I've written a bash script to accomplish this task. However, I suspect it can be optimized substantially. (I also suspect that I may have been better off writing it in Perl, but that is another story.)
Here is the script. Is this about as good as it gets? Or can the performance be optimized? Are there tools I can use that would profile the script for me and tell me where I'm spending all my execution time?
#!/bin/bash
tiles=`ls -d */*/*`
for oldPath in $tiles
do
oldY=`basename -s .png $oldPath`
zoomX=`dirname $oldPath`
zoom=`echo $zoomX | sed 's#\([^\]\)/.*#\1#'`
newY=`echo 2^$zoom-$oldY-1|bc`
mv ${zoomX}/${oldY}.png ${zoomX}/${newY}.png
done

for oldpath in */*/*
do
x=$(basename "$oldpath" .png)
zoom_y=$(dirname "$oldpath")
y=$(basename "$zoom_y")
ozoom=$(dirname "$zoom_y")
nzoom=$(echo "2^$zoom-$y-1" | bc)
mv "$oldpath" $nzoom/$y/$x.png
done
This avoids using sed. I like basename and dirname. However, you can also use bash (and Korn) shell notations such as:
y=${zoom_y#*/}
ozoom=${zoom_y%/*}
You might be able to do it all without invoking basename or dirname at all.

REWRITE due to misunderstanding of the formula and the updated var names. Still no subprocesses apart from mv and ls.
#!/bin/bash
tiles=`ls -d */*/*`
for thisPath in $tiles
do
thisFile=${thisPath#*/*/}
oldY=${thisFile%.png}
zoomX=${thisPath%/*}
zoom=${thisPath%/*/*}
newY=$(((1<<zoom) - oldY - 1))
mv ${zoomX}/${oldY}.png ${zoomX}/${newY}.png
done

It's likely that the overall throughput of your rename is limited by the filesystem. Choosing the right filesystem and tuning it for this sort of operation would speed up the overall job much more than tweaking the script.
If you optimize the script you'll probably see less CPU consumed but the same total duration. Since forking off the various subprocesses (basename, dirname, sed, bc) are probably more significant than the actual work you are probably right that a perl implementation would use less CPU because it can do all of those operations internally (including the mv).

I see 3 improvements I would do, if it was my script. Whether they have an huge impact - I don't think so.
But you should avoid as hell parsing the output of ls. Maybe this directory is very predictable, from the things found inside, but if I read your script correctly, you can use the globbing with for directly:
for thisPath in */*/*
repeatedly, $(cmd) is better than cmd with the deprecated backticks, which aren't nestable.
thisDir=$(dirname $thisPath)
arithmetic in bash directly:
newTile=$((2**$zoom-$thisTile-1))
as long as you don't need floating point, or output is getting too big.
I don't get the sed-part:
zoom=`echo $zoomX | sed 's#\([^\]\)/.*#\1#'`
Is there something missing after the backslash? A second one? You're searching for something which isn't a backslash, followed by a slash-something? Maybe it could be done purely in bash too.

one precept of computing credited to Donald Knuth is, "don't optimize too early." Scripts run pretty fast and 'mv' operations(as long as they're not going across filesystems where you're really copying it to another disk and then deleting the file) are pretty fast as well, as all the filesystem has to do in most cases is just rename the file or change its parentage.
Probably where it's spending most of its time is in that intial 'ls' operation. I suspect you have ALOT of files. There isn't much that can be done there. Doing it another language like perl or python is going to face the same hurdle. However you might be able to get more INTELLIGENCE and not limit yourself to 3 levels(//*).

Create a file from a large Makefile variable

I have a list of objects in a Makefile variable called OBJECTS which is too big for the command buffer. Therefore I'm using the following method to create a file listing the objects (to pass to ar):
objects.lst:
$(foreach OBJ,$(OBJECTS),$(shell echo "$(OBJ)">>$#))
While this works it is extremely slow (on Cygwin at least) and I don't like relying on shell commands and redirection.
Additionlly foreach is not intended for this purpose - it is evaluated before any commands are run which means I can't for example rm -f objects.lst before appending.
Is there a better way? I don't want to use incremental archiving as that causes problems with multiple jobs.
The only thing I can think of is parsing the Makefile with a separate script to read the object list or storing the object list in a separate file. Both solutions have their own problems though.

Try something like:
OBJECTS:=a b c d
objects.lst:
echo > $# <<EOF $(OBJECTS)
i.e. make use of the <<EOF functionality that is built into the shell. It does not have any max-length limitations.

In the following example I also replaced echo with a simple Perl script to split the arguments onto new lines but this is the jist of it..
objects.lst:
echo $(wordlist 1,99,$(OBJECTS))>$#
echo $(wordlist 100,199,$(OBJECTS))>>$#
echo $(wordlist 200,299,$(OBJECTS))>>$#
echo $(wordlist 300,399,$(OBJECTS))>>$#
...

How about something like this:
OBJECTS_AM=$(filter a% b% c% d% e% f% g% h% i% j% k% l% m%,$(OBJECTS))
OBJECTS_NZ=$(filter-out a% b% c% d% e% f% g% h% i% j% k% l% m%,$(OBJECTS))
objects.lst:
$(shell echo "$(OBJECTS_AM)">$#)
$(shell echo "$(OBJECTS_NZ)">>$#)
You might need to split it one or two more times, but it's not that bad, especially as the distribution of file names doesn't change all that often.

Here's a patch to gnu make that lets you directly write a variable into a file.
It creates a new 'writefile' function, similar to the existing 'info' function, except it takes a filename argument and writes to the file:
https://savannah.gnu.org/bugs/?35384

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio