I am trying to utilise the vcftools package to calculate weir and cockerham's fst. I would like to loop over two pairs of populations in the first instance and then loop these populations across all variants from the 1000 Genomes project: each chromosome contains a separate vcf file. For example, for pop1 vs pop2, for pop3 vs pop4 calculate fst for chromosomes 1-10. Each population file, for example, LWKfile contains a list of individuals that belong to this population.
I have attempted:
for population in LWK_GBR YRI_FIN; do
firstpop=$(echo $population | cut -d '_' -f1)
secondpop=$(echo $population | cut -d '_' -f2)
for filename in *.vcf.gz; do
vcftools --gzvcf ${filename} \
--weir-fst-pop /outdir/${firstpop}file \
--weir-fst-pop /outdir/${secondpop}file \
--out /out/${population}_${filename}
done
done
However this does not loop through all the files and seems to get stuck on chromosome 10. Is there a more efficient way to perform this in bash as I am concerned the loop within loop will be too slow.
However this does not loop through all the files and seems to get stuck on chromosome 10.
I am concerned the loop within loop will be too slow.
Are you sure that it is the for filename in *.vcf.gz which is too slow to loop over all files?
Try to put an echo before vcftools to see if it remain stuck or not.
You need to be sure on what takes too much time in order to be able to make the right choice.
For example if it's vcftools maybe you don't need to wait the end of this command and think about doing some asynchronous treatments.
If there is too much file for one loop, you should also consider make some parallel treatments.
Also, you seems to repeat the loop over all .vcf.gz files twice. It will be probably a little bit more quick to reverse your two loops.
Here is an example with parallel and async treatments using bash:
#!/bin/bash
MAX_PARALLEL_PIDS=4 # adjust regarding your own machin capacity (cpu available, etc... it could be dynamically calculated)
declare -a POPS
declare -a PIDS
POPS=("LWK_GBR" "YRI_FIN")
# your heavy treatment in a function
process() {
pop="${1}"
filename="${2}"
firstpop="${pop%%_*}" # no need to call an external program here
secondpop="${pop#*_}" # same here
vcftools --gzvcf "${filename}" \
--weir-fst-pop "/outdir/${firstpop}file" \
--weir-fst-pop "/outdir/${secondpop}file" \
--out "/out/${pop}_${filename}"
}
# a function which is usefull to wait all process when your "thread pool" reached its limits
wait_for_pids() {
for pid in "${PIDS[#]}"; do
[[ $pid =~ ^[0-9]+ ]] && wait $pid
done
unset PIDS
}
i=0
for filename in *.vcf.gz; do
if [[ $i -ge $MAX_PARALLEL_PIDS ]]; then
i=0
wait_for_pids
fi
for population in "${POPS[#]}"; do
process "${population}" "${filename}" & # You won't wait for the end here
PIDS[$i]=$!
(( i++ ))
done
done
# at the end wait for the remaining processes
wait_for_pids
N.B: Putting aside the variables inside a [[ conditions, you should pay attention on quoting your variables that can contain some spaces, especially the files names for example. It will break otherwise.
Related
This question already has answers here:
Bash while read loop extremely slow compared to cat, why?
(4 answers)
Closed 4 years ago.
When I do this with awk it's relatively fast, even though it's Row By Agonizing Row (RBAR). I tried to make a quicker more elegant bug resistant solution in Bash that would only have to make far fewer passes through the file. It takes probably 10 seconds to get through the first 1,000 lines with bash using this code. I can make 25 passes through all million lines of file with awk in about the same time! How come bash is several orders of magnitude slower?
while read line
do
FIELD_1=`echo "$line" | cut -f1`
FIELD_2=`echo "$line" | cut -f2`
if [ "$MAIN_REF" == "$FIELD_1" ]; then
#echo "$line"
if [ "$FIELD_2" == "$REF_1" ]; then
((REF_1_COUNT++))
fi
((LINE_COUNT++))
if [ "$LINE_COUNT" == "1000" ]; then
echo $LINE_COUNT;
fi
fi
done < temp/refmatch
Bash is slow. That's just the way it is; it's designed to oversee the execution of specific tools, and it was never optimized for performance.
All the same, you can make it less slow by avoiding obvious inefficiencies. For example, read will split its input into separate words, so it would be both faster and clearer to write:
while read -r field1 field2 rest; do
# Do something with field1 and field2
instead of
while read line
do
FIELD_1=`echo "$line" | cut -f1`
FIELD_2=`echo "$line" | cut -f2`
Your version sets up two pipelines and creates four children (at least) for every line of input, whereas using read the way it was designed requires no external processes whatsoever.
If you are using cut because your lines are tab-separated and not just whitespace-separated, you can achieve the same effect with read by setting IFS locally:
while IFS=$'\t' read -r field1 field2 rest; do
# Do something with field1 and field2
Even so, don't expect it to be fast. It will just be less agonizingly slow. You would be better off fixing your awk script so that it doesn't require multiple passes. (If you can do that with bash, it can be done with awk and probably with less code.)
Note: I set three variables rather than two, because read puts the rest of the line into the last variable. If there are only two fields, no harm is done; setting a variable to an empty string is something bash can do reasonably rapidly.
As #codeforester points out, the original bash script spawns so many subprocesses.
Here's the modified version to minimize the overheads:
#!/bin/bash
while IFS=$'\t' read -r FIELD_1 FIELD_2 others; do
if [[ "$MAIN_REF" == "$FIELD_1" ]]; then
#echo "$line"
if [[ "$FIELD_2" == "$REF_1" ]]; then
let REF_1_COUNT++
fi
let LINE_COUNT++
echo "$LINE_COUNT"
if [[ "$LINE_COUNT" == "1000" ]]; then
echo "$LINE_COUNT"
fi
fi
done < temp/refmatch
It runs more than 20 times faster than the original one but I'm afraid it may be the limitation of bash script.
I want to rename multiple individual entries in a long file based on a comma delimited table. I figured out a way how to do it, but I feel it's highly inefficient and I'm wondering if there's a better way to do it.
My file contains >30k entries like this this:
>Gene.1::Fmerg_contig0.1::g.1::m.1 Gene.1::Fmerg_contig0.1::g.1
TPAPHKMQEPTTPFTPGGTPKPVFTKTLKGDVVEPGDGVTFVCEVAHPAAYFITWLKDSK
>Gene.17::Fmerg_Transcript_1::g.17::m.17 Gene.17::Fmerg_Transcript_1::g.17
PLDDKLADRVQQTDAGAKHALKMTDEGCKHTLQVLNCRVEDSGIYTAKATDENGVWSTCS
>Gene.15::Fmerg_Transcript_1::g.15::m.15 Gene.15::Fmerg_Transcript_1::g.15
AQLLVQELTEEERARRIAEKSPFFMVRMKPTQVIENTNLSYTIHVKGDPMPNVTFFKDDK
And the table with the renaming information looks like this:
original,renamed
Fmerg_contig0.1,Fmerg_Transcript_0
Fmerg_contig1.1,Fmerg_Transcript_1
Fmerg_contig2.1,Fmerg_Transcript_2
The inefficient solution I came up with looks like this:
#!/bin/bash
#script to revert dammit name changes
while read line; do
IFS="," read -r -a contig <<< "$line"
sed -i "s|${contig[1]}|${contig[0]}|g" Fmerg_final.fasta.transdecoder_test.pep
done < Fmerg_final.fasta.dammit.namemap.csv
However, this means that sed iterates over the table once per entry to be renamed.
I could imagine there is a way to only access each line once and iterate over the name list that way, but I'm not sure how to tackle this. I chose bash because this is the language that I'm most fluent in. But I'm not adverse to use perl or python if they offer an easier solution.
This is On problem and you solved it with On solution so I wouldn't consider it inefficient. However, if you are good with bash you can do more it no problem.
Divide and conquer.
I have done this many times as you can reduce the work time closer to the time it takes one item to be processed ..
Take this pseudo code, I call a method that cuts up the 30K file into say X parts, then I call it in a loop with the & option to run as threads.
declare -a file_part_names
# cut files into parts
function cut_file_into_parts() {
orig_file="$1"
number_parts="$1"
}
# call method to handle renaming a file
function rename_fields_in_file() {
file_part="$1"
while read line; do
IFS="," read -r -a contig <<< "$line"
sed -i "s|${contig[1]}|${contig[0]}|g" "$tmp_file"
done < "$file_part"
}
# main
cut_file_into_parts "Fmerg_final.fasta.dammit.namemap.csv"
for each file_part ;do
if threads_pids < 100
rename_fields_in_file $each &
else
sleep 10
fi
done
wait
#Now that you have a pile of temp files processed, combine them all.
for each temp file do
temp_$i.txt >> final_result.txt
done
In summary, cut the big file into say 500 tmp files labled file1, file2 etc. in say /tmp/folder. Then go through them one at a time but launch them as child processes up to say 100 running at the same time, keep the pipe full by checking that if over 100 do nothing (sleep 10) if under add more. When done, one more loop to combine file1_finish.txt to file2_finish.txt etc. which is super quick.
NOTE: if this is too much you can always just break the file up and call the the same script X times for each file instead of using threads.
I have a set of scripts, e.g.
01_some_stuff1
02_some_stuff2
03_some_stuff3a
03_some_stuff3b
04_some_stuff4a
04_some_stuff4b
These scripts should run ordered by their number and scripts with the same number should run in parallel.
My first idea was to iterate the possible numbers
for n in $(seq -f %02.0f 0 99); do
for s in "${n}_*"; do
export CURRENT_JOB="${s}"
"${s}" &
done
wait
done
Is this a safe method? Is there a more elegant solution that also allows to set a different environment for the inner loop elements?
You could use GNU Parallel like this:
#!/bin/bash
# Don't barf if no matching files when globbing
shopt -s nullglob
for n in $(printf "%02d " {1..4}); do
# Get list (array) of matching scripts
scripts=( ${n}_* )
if [ ${#scripts[#]} -gt 0 ]; then
parallel --dry-run -k 'CURRENT_JOB={} ./{}' ::: ${scripts[#]}
fi
echo barrier
done
Sample Output
CURRENT_JOB=01_some_stuff1 ./01_some_stuff1
barrier
CURRENT_JOB=02_some_stuff2 ./02_some_stuff2
barrier
CURRENT_JOB=03_some_stuff3a ./03_some_stuff3a
CURRENT_JOB=03_some_stuff3b ./03_some_stuff3b
CURRENT_JOB=03_some_stuff3c ./03_some_stuff3c
barrier
CURRENT_JOB=04_some_stuff4a ./04_some_stuff4a
CURRENT_JOB=04_some_stuff4b ./04_some_stuff4b
barrier
Remove the echo barrier and --dry-run to actually run it properly.
The only real change you need is to avoid quoting the * in your pattern. If you are using bash 4.0 or later, you can use brace expansion to eliminate the dependency on seq.
# for n in $(seq -f %02.0f 0 99); do
for n in {00..99}; do
for s in "${n}"_*; do
export CURRENT_JOB="${s}"
"${s}" &
done
wait
done
I have a loop, where I'm making incremental modifications to a large file. Rather than write to disk each time, I though I'd use named pipes. However, this means that I'll need a unique name for each iteration of the loop, since I can't seem to redirect output back into the same named pipe.
$ mkfifo fifotemp
$ echo qwerty > fifotemp &
$ grep qwe <fifotemp >fifotemp &
$ cat <fifotemp
[hangs]
I could create a new named pipe for each iteration, but this seemed inelegant. Is there a better way?
Potentially you could use plain pipes and recursive functions. You would need to pass everything into the recursive function to determine when to quit and what processing is needed at each recursion level. This example just adds the recursion level at the front of each line for each level, and quits at level 4:
#!/bin/bash
doEdits() {
while read -r line
do
echo "$1 $line"
done
}
doRecursion() {
level=$(($1 + 1))
#echo "doRecursion $level" 1>&2
if [ $level -lt 4 ]
then
doEdits $level | doRecursion $level
else
# Just output all the input
cat
fi
}
doRecursion 0 < myInputFile > myOutputFile
I assume the number of recursion levels is fairly limited, otherwise you could run into system limitations on the number of open processes and pipes.
One advantage here is that each pipe should only need a small buffer. This could also be fast if your machine has multiple processors.
I'm evaluating if GNU Parallel can be used to search files stored on a system in parallel. There can be only one file for each day of year (doy) on the system (so a maximum of 366 files per year). Let's say there are 3660 files on the system (about 10 years worth of data). The system could be a multi-CPU multi-core Linux or a multi-CPU Solaris.
I'm storing the search commands to run on the files in an array (one command per file). And this is what I'm doing right now (using bash) but then I have no control on how many searches to start in parallel (definitely don't want to start all 3660 searches at once):
#!/usr/bin/env bash
declare -a cmds
declare -i cmd_ctr=0
while [[ <condition> ]]; do
if [[ -s $cur_archive_path/log.${doy_ctr} ]]; then
cmds[$cmd_ctr]="<cmd_to_run>"
let cmd_ctr++
fi
done
declare -i arr_len=${#cmds[#]}
for (( i=0; i<${arr_len}; i++ ));
do
# Get the command and run it in background
eval ${cmds[$i]} &
done
wait
If I were to use parallel (which will automatically figure out the max. CPUs/cores and start only so many searches in parallel), how can I reuse the array cmds with parallel and rewrite the above code? The other alternative is to write all commands to a file and then do cat cmd_file | parallel
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Using-shell-variables says:
parallel echo ::: "${V[#]}"
You do not want the echo, so:
parallel ::: "${cmds[#]}"
If you do not need $cmds for anything else, then use 'sem' (which is an alias for parallel --semaphore) https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Working-as-mutex-and-counting-semaphore
while [[ <condition> ]]; do
if [[ -s $cur_archive_path/log.${doy_ctr} ]]; then
sem -j+0 <cmd_to_run>
fi
done
sem --wait
You have not described what <condition> might be. If you are simply doing a something like a for-loop you could replace the whole script with:
parallel 'if [ -s {} ] ; then cmd_to_run {}; fi' ::: $cur_archive_path/log.{1..3660}
(based on https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Composed-commands).