speed up bedpost by parallelization - parallel-processing

I am using an fsl tool called bedpostx which is used to fit a diffusion model to my (preprocessed) data. The problem is that this process has been run for over 24 hours now. I would like to speed up the process by poor man parallelization. To do so I should run bedpostx_single_slice.sh in several terminal, applying this to a batch of slices. I keep getting errors though. This is the command I launch in the terminal:
bedpostx_single_slice.sh Tirocinio/Dati_DTI/DTI_analysis_copy 37
Where the first input is the directory with my data and 37 is the ith slice I want to analyze. This is the error that I get:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
Unfortunately there is not much documentation on this tools, plus I am pretty much new in programming.
If it can helps, following there is the script of bedpostx_single_slice.sh:
#!/bin/sh
# Copyright (C) 2012 University of Oxford
export LC_ALL=C
subjdir=$1
slice=$2
shift
shift
opts=$*
slicezp=`${FSLDIR}/bin/zeropad $slice 4`
${FSLDIR}/bin/xfibres\
--data=$subjdir/data_slice_$slicezp\
--mask=$subjdir/nodif_brain_mask_slice_$slicezp\
-b $subjdir/bvals -r $subjdir/bvecs\
--forcedir --logdir=$subjdir.bedpostX/diff_slices/data_slice_$slicezp \
$opts > $subjdir.bedpostX/logs/log$slicezp && echo Done && touch $subjdir.bedpostX/logs/monitor/$slice

BedpostX is pretty well parallelized itself by the FSL team now. You would be much better off taking advantage of that directly.
If you want a quick and easy way to parallelize, check out Parallelizing FSL without the pain from NeuroDebian.

Related

The 'pwd' command take much time cost after loading a big file(eg:40M)

The following test shell code test at: centOS 7 with bash shell; The code contains three phrase; phrase 1, call pwd command; phrase 2, read a big file(cat the file); phrase 3, do the same thing as phrase 1;
The phrase 3 time cost is much bigger than phrase 1(eg: 21s vs 7s)
But at the MacOS platform, the time cost of phrase 1 and phrase 3 is equal.
#!/bin/bash
#phrase 1
timeStart1=$(date +%s)
for ((ip=1;ip<=10000;ip++));
do
nc_result=$(pwd)
done
timeEnd1=$(date +%s)
timeDelta=$((timeEnd1-timeStart1))
echo $timeDelta
#phrase 2
fileName='./content.txt' #one big file,eg. a 39M file
content=`cat $fileName`
#phrase 3
timeStart2=$(date +%s)
for ((ip=1;ip<=10000;ip++));
do
nc_result=$(pwd)
done
timeEnd2=$(date +%s)
timeDelta2=$((timeEnd2-timeStart2))
echo $timeDelta2
Slawomir's answer has a key part of the problem, but without full explanation. The answer to What does it mean 'fork()' will copy address space of original process? on Unix & Linux StackExchange has some good background.
A command substitution -- $(...) -- is implemented by fork()ing off a separate copy of your shell, which a command -- in this case pwd -- is executed in.
Now, on most UNIXlike systems, fork() is extremely efficient, and doesn't actually copy all your memory until an operation is performed that changes those memory blocks: Each copy keeps the same virtual memory ranges as the original (so its pointers remain valid), but with the MMU configured to throw an error when there's a write to it, so the OS can silently catch that error and allocate separate physical memory for each branch.
There's still a cost to setting up the pages configured to be copied to new physical memory when they change, though! Some platforms -- like Cygwin -- have worse / more expensive fork implementations; some (apparently MacOS?) have faster ones; that difference is what you're measuring here.
Two takeaways:
It's not pwd that's slow, it's $( ). It'd be just as slow with $(true) or any other shell builtin, and considerably slower with any non-builtin command.
Don't use $(pwd) at all -- there's no reason to pay that cost to split off a child process to measure its working directory, when you could just ask the parent shell for its working directory directly by using nc_result=$PWD.
$() invokes a subshell. The 3rd command starts when the second command is running.

Creating steps in bash script

To start, I am relatively new to shell scripting. I was wondering if anyone could help me create "steps" within a bash script. For example, I'd like to run one analysis and then have the script proceed to the next analysis with the output files generated in the first analysis.
So for example, the script below will generate output file "filt_C2":
./sortmerna --ref ./rRNA_databases/silva-arc-23s-id98.fasta,./index/silva-arc-23s-id98.db:./rRNA_databases/silva-bac-23s-id98.fasta,./index/silva-bac-23s-id98.db:./rRNA_databases/silva-euk-18s-id95.fasta,./index/silva-euk-18s-id95.db:./rRNA_databases/silva-euk-28s-id98.fasta,./index/silva-euk-28s-id98.db:./rRNA_databases/rfam-5s-database-id98.fasta,./index/rfam-5s-database-id98.db:./rRNA_databases/rfam-5.8s-database-id98.fasta,./index/rfam-5.8s.db --reads ~/path/to/file/C2.fastq --aligned ~/path/to/file/rrna_C2 --num_alignments 1 --other **~/path/to/file/filt_C2** --fastx --log -a 8 -m 64000
Once this step is complete, I would like to run another step that will use the output file "filt_C2" that was generated. I have been creating multiple bash scripts for each step; however, it would be more efficient if I could do each step in one bash file. So, is there a way to make a script that will complete Step 1, then move to Step 2 using the files generated in step 1? Any tips would be greatly appreciated. Thank you!
Welcome to bash scripting!
Here are a few tips:
You can have multiple lines, as many as you like, in a bash script file.
You may call other bash scripts (or any other executable programs) from within your shell script, just as Frank has mentioned in his answer.
You may use variables to make your script more generic, say, if you want to name your result "C3" instead of "C2". (Not shown below)
You may use bash functions if your script becomes more complicated, e.g. see https://ryanstutorials.net/bash-scripting-tutorial/bash-functions.php
I recommend placing sortmerna in a directory that is in your environmental PATH variable, and to replace the multiple ~/path/to/file to another variable (say WORKDIR) for consistency and flexibility.
For example, let’s say you name your script print_analysis.sh:
#!/bin/bash
# print_analysis.sh
# Written by Nikki E. Andrzejczyk, November 2018
# Set variables
WORKDIR=~/path/to/file
# Stage 1: Generate filt_C2 using SortMeRNA
./sortmerna --ref ./rRNA_databases/silva-arc-23s-id98.fasta,./index/silva-arc-23s-id98.db:./rRNA_databases/silva-bac-23s-id98.fasta,./index/silva-bac-23s-id98.db:./rRNA_databases/silva-euk-18s-id95.fasta,./index/silva-euk-18s-id95.db:./rRNA_databases/silva-euk-28s-id98.fasta,./index/silva-euk-28s-id98.db:./rRNA_databases/rfam-5s-database-id98.fasta,./index/rfam-5s-database-id98.db:./rRNA_databases/rfam-5.8s-database-id98.fasta,./index/rfam-5.8s.db \
--reads "$WORKDIR/C2.fastq" \
--aligned "$WORKDIR/rrna_C2" \
--num_alignments 1 \
--other "$WORKDIR/filt_C2" \
--fastx --log -a 8 -m 64000
# Stage 2: Process filt_C2 to generate result_C2
./stage2 "$WORKDIR/filt_C2" > "$WORKDIR/result_C2.txt"
# Stage 3: Print the result in result_C2
less "$WORKDIR/result_C2.txt"
Note how I use trailing backslash \ so that I could split the long sortmerna command into multiple shorter lines, and the use of # for human-readable comments.
There is still room for improvement as mentioned above but not implemented in this quick example, but hope this quick example shows you how to expand your bash script and make it do multiple steps in one go.
Bash is actually a very powerful scripting and programming language. To learn more, you may want to start with Bash tutorials like the following:
https://ryanstutorials.net/bash-scripting-tutorial/
http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html
Hope this helps! If you have any other questions, or if I had misunderstood your question, please feel free to ask!
Cheers,
Anthony

How to process multiple files in sequence using OPEN MP and/or MPI?

I'm using the parallel_multicore version of the DBSCAN clustering algorithm available below:
http://cucis.ece.northwestern.edu/projects/Clustering/index.html
To run the code simply requires the following line:
./omp_dbscan -i trial.txt -m 4 -e 0.5 -o output.txt -t 8
where -i is the input, -m and -e are two parameters, -o is the output and -t is the number of threads.
What I want to do is adapt this command so that I can process lots of input files (say trial_1.txt, trial_2.txt, trial_3.txt and so on) sequentially, but I'm not really sure how to do this in this language? Any help would be greatly appreciated as I'm thoroughly lost!
Any unix server will have a shell installed.
Unix shells have been used for automating simple processes since the beginning of Unix. They are the scripting language at the very heart of any Unix system.
Their syntax is very easy to learn, and it will allow you to easily automate such tasks. So get an tutorial to shell scripting!

Matlab - strange characters on output

I run my Matlab scripts from bash in the following way:
matlab -nodesktop -nosplash -nodisplay -r "matlabfun()" &> log
The resulting log file starts and ends with a strange character sequence that in less appears as: ESC[?1hESC=. Do you know what this is caused by?
I can reproduce your error. From this table I would assume, that Matlab forces the cursor to be in the application.
I have now idea where else it should be in a bash session, maybe it is a leftover from the graphical version or other platforms. You can just ignore it.
Do not have the chance to verify it myself, but this website suggests that it is because bash tries to help you.
The solution is to set the TERM to invalid entry:
TERM=vt444

qsub for one machine?

A frequent problem I encounter is having to run some script with 50 or so different parameterizations. In the old days, I'd write something like (e.g.)
for i in `seq 1 50`
do
./myscript $i
done
In the modern era though, all my machines can handle 4 or 8 threads at once. The scripts aren't multithreaded, so what I want to be able to do is run 4 or 8 parameterizations at a time, and to automatically start new jobs as the old ones finish. I can rig up a haphazard system myself (and have in the past), but I suspect that there must be a linux utility that does this already. Any suggestions?
GNU parallel does this. With it, your example becomes:
parallel ./myscript -- `seq 1 50`

Resources