I guess this question/problem can be tackled by the satisfaction of one (or more) of the below problems
1) How do I show jobs remaining?
2) How do I pretty the output of --eta
1) I've checked the man page, and I am presently using $PARALLEL_SEQ in my function, but how can I get the jobs remaining? Parallel is help me to compile about 800 files, and I'd would like to know my jobs remaining.
2) Alternatively, is there a better (nicer) way to output --eta? My output looks really messy. I would only like to see a single ETA.
The parallel flags I am using: --no-notice --keep-order --group
Example of output:
819: Compiling form: USER_Q ok
ETA: 8s 13left 0.61avg local:4/819/100%/0.6s
820: Compiling form: USER_RESERVE_STOCK ok
ETA: 7s 12left 0.61avg local:4/820/100%/0.6s
821: Compiling form: USERS_AUTO ok
ETA: 7s 11left 0.61avg local:4/821/100%/0.6s
822: Compiling form: USERS ok
ETA: 6s 10left 0.61avg local:4/822/100%/0.6s
823: Compiling form: USERS_MENU ok
ETA: 6s 9left 0.61avg local:4/823/100%/0.6s
824: Compiling form: USER_SUPP ok
ETA: 4s 8left 0.61avg local:4/824/100%/0.6s
825: Compiling form: VARIANCE_L ok
ETA: 4s 7left 0.61avg local:4/825/100%/0.6s
826: Compiling form: VAR_L ok
ETA: 3s 6left 0.61avg local:4/826/100%/0.6s
827: Compiling form: WASTE ok
ETA: 3s 5left 0.61avg local:4/827/100%/0.6s
828: Compiling form: WASTE_L ok
ETA: 2s 2left 0.61avg local:2/830/100%/0.6s
829: Compiling form: WEATHER ok
You will need to redirect the output of your jobs to have --eta look nice:
seq 10 | parallel --eta 'echo foo; sleep .{}' >/dev/null
That will also show number of jobs left. You may also find --bar usable:
seq 10 | parallel --bar 'echo foo; sleep .{}' >/dev/null
Or a more advanced example:
seq 1000 |
parallel -j30 --bar '(echo {};sleep 0.3)' \
2> >(perl -pe 'BEGIN{$/="\r";$|=1};s/\r/\n/g' |
zenity --progress --auto-kill) | wc
Related
I have a list of configuration files:
cfg1.cfg
cfg2.cfg
cfg3.cfg
cfg4.cfg
cfg5.cfg
cfg6.cfg
cfg7.cfg
...
that serve as input for two scripts:
script1.sh
script2.sh
which I run sequentially as follows:
script1.sh cfgX.cfg && script2.sh cfgX.cfg
where X=1, 2, 3, ...
These scripts are not parallelised and take a long time to run. How can I launch them in parallel, let's say 4 at the time, so I do not kill the server where I run them?
For just one script I tried a brute force approach similar to:
export COUNTER_LIMIT=4
export COUNTER=1
for each in $(ls *.cfg)
do
INSTRUCTION="./script1.sh $each "
if (($COUNTER >= $COUNTER_LIMIT)) ;
then
$INSTRUCTION &&
export COUNTER=$(($COUNTER-$COUNTER_LIMIT));
echo
sleep 600s
else
$INSTRUCTION &
sleep 5s
fi
echo $COUNTER
export COUNTER=$(($COUNTER+1));
done
(the sleeps are because for some reason the scripts cannot be initiated at the same time...)
So, ho can I do so that the double ampersands in
script1.sh cfgX.cfg && script2.sh cfgX.cfg
dont' block the brute force parallelisation?
I also accept better and simpler approaches ;)
Cheers
jorge
UPDATE
I should have mentioned that the config files are not necessarily sequentially named and can have any name, I just made them like this to make the example as simple as possible.
parallel --jobs 4 \
--load 50% \
--bar \
--eta "( echo 1st-for-{}; echo 2nd-for-{} )" < aListOfAdHocArguments.txt
0% 0:5=0s
1st-for-Abraca
2nd-for-Abraca
20% 1:4=0s
1st-for-Dabra
2nd-for-Dabra
40% 2:3=0s
1st-for-Hergot
2nd-for-Hergot
60% 3:2=0s
1st-for-Fagot
2nd-for-Fagot
80% 4:1=0s
100% 5:0=0s
Q : How can I launch them in parallel, let's say 4 at the time, so I do not kill the server where I run them?
A lovely task for GNU parallel.
First let's check the localhost ecosystem ( exosystems, executing parallel-jobs over ssh-connected remote-hosts possible, yet exceed the scope of this post ) :
parallel --number-of-cpus
parallel --number-of-cores
parallel --show-limits
For more configuration details beyond the --jobs 4, potentially --memfree or --noswap, --load <max-load> or --keep-order and --results <aFile> or --output-as-files :
man parallel
parallel --jobs 4 \
--bar \
--eta "( script1.sh cfg{}.cfg; script2.sh cfg{}.cfg )" ::: {1..123}
Here, emulated by a just pair of tandem echo-s for down-counted indexes, so progress-bars are invisible and Estimated-Time-of-Arrival --eta indications are almost instant... :
parallel --jobs 4 \
--load 50% \
--bar \
--eta "( echo 1st-for-cfg-{}; echo 2nd-for-cfg-{} )" ::: {10..0}
0% 0:11=0s 7
1st-for-cfg-10
2nd-for-cfg-10
9% 1:10=0s 6
1st-for-cfg-9
2nd-for-cfg-9
18% 2:9=0s 5
1st-for-cfg-8
2nd-for-cfg-8
27% 3:8=0s 4
1st-for-cfg-7
2nd-for-cfg-7
36% 4:7=0s 3
1st-for-cfg-6
2nd-for-cfg-6
45% 5:6=0s 2
1st-for-cfg-5
2nd-for-cfg-5
54% 6:5=0s 1
1st-for-cfg-4
2nd-for-cfg-4
63% 7:4=0s 0
1st-for-cfg-3
2nd-for-cfg-3
72% 8:3=0s 0
1st-for-cfg-2
2nd-for-cfg-2
81% 9:2=0s 0
1st-for-cfg-1
2nd-for-cfg-1
90% 10:1=0s 0
1st-for-cfg-0
2nd-for-cfg-0
Update
You added:
I should have mentioned that the config files are not necessarily sequentially named and can have any name, I just made them like this to make the example as simple as possible.
The < list_of_arguments solves this ex-post changed problem definition:
parallel [options] [command [arguments]] < list_of_arguments
This would be fairly simple with find and xargs. This would run four processes in parallel, and for any given config file will complete script1.sh before running script2.sh:
find . -name '*.cfg' -print0 | xargs -0 -P 4 -iCFG sh -c 'script1.sh CFG && script2.sh CFG'
I did some simulation test, first I created the file like you describe.
printf '%s\n' cfg{1..100}.cfg > file.txt
Now the script to process it.
#!/bin/bash
file=file.txt
limit=2
array=()
while read -r cfg; do
array+=("$cfg")
done < "$file"
for ((n=0; n<limit; n++)); do
for ((i=n; i<${#array[#]}; i+=limit)); do
echo script1.sh "${array[i]}" && echo script2.sh "${array[i]}" && sleep 2; echo
done &
done
wait
Now if you run that script you should see what's going to happen. The echo and sleep is there just for visual aid :-), you can remove them if you decided to actually run the script. Change the value of limit to your own hearts content. The idea and technique howto solve that particular problem did not came from me. It came from this guy. https://github.com/e36freak/, give credit where it is due...
I am new to using snakemake. I want to run my fastq files that are in one folder with nanofilt. I want to run this with snakemake because I need it to create a pipeline. This is my snake make script:
rule NanoFilt:
input:
"data/samples/{sample}.fastq"
output:
"nanofilt_out.gz"
shell:
"gunzip -c {input} | NanoFilt -q 8 | gzip > {output}"
When I run it I get the following error message:
WildcardError in line 2:
Wildcards in input files cannot be determined from output files:
'sample'
EDIT
I tried searching the error message but still couldnt figure out how to make it work. Can anyone help me?
So I tried what people on here told me so the new script is this:
samples = ['fastqrunid4d89b52e7b9734bd797205037ef201a30be415c8293','fastqrunid4d89b52e7b9734bd797205037ef201a30be415c8292','fastqrunid4d89b52e7b9734bd797205037ef201a30be415c8291','fastqrunid4d89b52e7b9734bd797205037ef201a30be415c8290']
rule all:
input:
[f"nanofilt_out_{sample}.gz" for sample in samples]
rule NanoFilt:
input:
"zipped/zipped{sample}.gz"
output:
"nanofilt_out_{sample}.gz"
shell:
"gunzip -c {input} | NanoFilt -q 8 | gzip > {output}"
but when I run this I get the following error message:
Error in rule NanoFilt:
Removing output files of failed job NanoFilt since they might be corrupted:
nanofilt_out_fastqrunid4d89b52e7b9734bd797205037ef201a30be415c8292.gz
jobid: 4
output: nanofilt_out_fastqrunid4d89b52e7b9734bd797205037ef201a30be415c8290.gz
shell:
gunzip -c zipped/zippedfastqrunid4d89b52e7b9734bd797205037ef201a30be415c8290.gz | NanoFilt -q 8 | gzip > nanofilt_out_fastqrunid4d89b52e7b9734bd797205037ef201a30be415c8290.gz
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Does anyone know how to fix this?
The 'idea' of SnakeMake is that you specify what output you want (through for instance rule all), and that SnakeMake takes a look at all the rules defined and knows how to get to the desired output.
When you tell SnakeMake you want as output nanofilt_out.gz, how does it know what sample to take? It doesn't.. if it would just take any of the possible sample files, then we would lose the information about which file it belongs to. To solve this we need in the output also the same wildcard as in the input:
rule NanoFilt:
input:
"data/samples/{sample}.fastq"
output:
"nanofilt_out_{sample}.gz"
shell:
"gunzip -c {input} | NanoFilt -q 8 | gzip > {output}"
This way SnakeMake can make an output for every sample. You do somehow need to adjust the pipeline that you specify which output you want, maybe something like this:
samples = [1,2,3]
rule all:
input:
[f"nanofilt_out_{sample}.gz" for sample in samples]
I made it work. The working code looks like this.
rule NanoFilt:
input:
expand("zipped/zipped.gz", sample=samples)
output:
"filtered/nanofilt_out.gz"
conda:
"envs/nanoFilt.yaml"
shell:
"gunzip -c {input} | NanoFilt -q 6 | gzip > {output}"
I have around 135000 .TIF files (1.2KB to 1.4KB) sitting on my hard drive. I need to extract text out of those files. If I run tesseract as a cron job I am getting 500 to 600 per hour at the most. Can anyone suggest me strategies so I can get atleast 500 per minute?
UPDATE:
Below is my code after implementing on suggestions given by #Mark still I dont seem to go beyond 20 files per min.
#!/bin/bash
cd /mnt/ramdisk/input
function tess()
{
if [ -f /mnt/ramdisk/output/$2.txt ]
then
echo skipping $2
return
fi
tesseract --tessdata-dir /mnt/ramdisk/tessdata -l eng+kan $1 /mnt/ramdisk/output/$2 > /dev/null 2>&1
}
export -f tess
find . -name \*.tif -print0 | parallel -0 -j100 --progress tess {/} {/.}
You need GNU Parallel. Here I process 500 TIF files of 3kB each in 37s on an iMac. By way of comparison, the same processing takes 160s if done in a sequential for loop.
The basic command looks like this:
parallel --bar 'tesseract {} {.} > /dev/null 2>&1' ::: *.tif
which will show a progress bar and use all available cores on your machine. Here it is in action:
If you want to see what it would do without actually doing anything, use parallel --dry-run.
As you have 135,000 files it will probably overflow your command line length - you can check with sysctl like this:
sysctl -a kern.argmax
kern.argmax: 262144
So you need to pump the filenames into GNU Parallel on its stdin and separate them with null characters so you don't get problems with spaces:
find . -iname \*.tif -print0 | parallel -0 --bar 'tesseract {} {.} > /dev/null 2>&1'
If you are dealing with very large numbers of files, you probably need to consider the possibility of being interrupted and restarted. You could either mv each TIF file after processing to a subdirectory called processed so that it won't get done again on restarting, or you could test for the existence of the corresponding txt file before processing any TIF like this:
#!/bin/bash
doit() {
if [ -f "${2}.txt" ]; then
echo Skipping $1...
return
fi
tesseract "$1" "$2" > /dev/null 2>&1
}
export -f doit
time parallel --bar doit {} {.} ::: *.tif
If you run that twice in a row, you will see it is near instantaneous the second time because all the processing was done the first time.
If you have millions of files, you could consider using multiple machines in parallel, so just make sure you have ssh logins to each of the machines on your network and then run across 4 machines, including the localhost like this:
parallel -S :,remote1,remote2,remote3 ...
where : is shorthand for the machine on which you are running.
I posted this on superuser yesterday, but maybe it's better here. I apologize if I am incorrect.
I am trying to trouble shoot the below aria2c command run using ubuntu 14.04. Basically the download starts and gets to about 30 minutes left, then errors with a timeout error. The file being downloaded is 25GB and there is often multiple files that are downloaded using a loop. Any suggestions to make this more efficient and stable? Currently, the each file takes about 4 hours to download, which is ok as long as there are no errors. I do get an aria2c file with the partial dowloaded file as well. Thank you :).
aria2c -x 4 -l log.txt -c -d /home/cmccabe/Desktop/download --http-user "xxxxx" --http-passwd xxxx xxx://www.example.com/x/x/xxx/"file"
I apologize for the tag as I am not able to create a new one, that was the closest.
You can write a loop to rerun aria2c until all files are download (see this gist).
Basically, you can put all the links in a file (e.g. list.txt):
http://foo/file1
http://foo/file2
...
Then run loop_aria2.sh:
#!/bin/bash
aria2c -j5 -i list.txt -c --save-session out.txt
has_error=`wc -l < out.txt`
while [ $has_error -gt 0 ]
do
echo "still has $has_error errors, rerun aria2 to download ..."
aria2c -j5 -i list.txt -c --save-session out.txt
has_error=`wc -l < out.txt`
sleep 10
done
aria2c -j5 -i list.txt -c --save-session out.txt will downloads 5 files in parallel (-j5) and writes the failed files into out.txt. If the out.txt is not empty ($has_error -gt 0), then rerun the same command to continue downloading. The -c option of aria2c will skip completed files.
PS: another simpler solution (without checking error), which just run the aria2 1000 times if you don't mind :)
seq 1000 | parallel -j1 aria2c -i list.txt -c
I am trying to download a large number of files from a webpage (which contains only a image, so I can use a simple wget), but want to speed it up using GNU Parallel. Can anyone please help me parallelize this for loop? Thanks.
for i in `seq 1 1000`
do
wget -O "$i.jpg" www.somewebsite.com/webpage
done
You could do it like this:
seq 1 1000 | parallel wget www.somewebsite.com/webpage/{}.jpg
You can also use the -P option to specify the number of jobs you want to run concurrently.
Also you may decide to use curl instead like:
parallel -P 1000 curl -o {}.jpg www.somewebsite.com/webpage/{}.jpg ::: {1..1000}