for loop in GNU Parallel - bash

I am trying to download a large number of files from a webpage (which contains only a image, so I can use a simple wget), but want to speed it up using GNU Parallel. Can anyone please help me parallelize this for loop? Thanks.
for i in `seq 1 1000`
do
wget -O "$i.jpg" www.somewebsite.com/webpage
done

You could do it like this:
seq 1 1000 | parallel wget www.somewebsite.com/webpage/{}.jpg
You can also use the -P option to specify the number of jobs you want to run concurrently.
Also you may decide to use curl instead like:
parallel -P 1000 curl -o {}.jpg www.somewebsite.com/webpage/{}.jpg ::: {1..1000}

Related

Iterations of a bash script to run in parallel

I have a bash script that looks like below.
$TOOL is another script which runs 2 times with different inputs(VAR1 and VAR2).
#Iteration 1
${TOOL} -ip1 ${VAR1} -ip2 ${FINAL_PML}/$1$2.txt -p ${IP} -output_format ${MODE} -o ${FINAL_MODE_DIR1}
rename mods mode_c_ ${FINAL_MODE_DIR1}/*.xml
#Iteration 2
${TOOL} -ip1 ${VAR2} -ip2 ${FINAL_PML}/$1$2.txt -p ${IP} -output_format ${MODE} -o ${FINAL_MODE_DIR2}
rename mods mode_c_ ${FINAL_MODE_DIR2}/*.xml
Can I make these 2 iterations in parallel inside a bash script without submitting it in a queue?
If I read this right, what you want is to run them in background.
c.f. https://linuxize.com/post/how-to-run-linux-commands-in-background/
More importantly, if you are going to be writing scripts, PLEASE read the following closely:
https://www.gnu.org/software/bash/manual/html_node/index.html#SEC_Contents
https://mywiki.wooledge.org/BashFAQ/001

Splitting wireshark large size with pcap splitter with bash

I have large pcapng files, and I want to split them based on my desired wireshark filters. I want to split my files by the help of bash scripts and using pcapsplitter, but when I use a loop, it always gives me the same file.
I have written a small code.
#!/bin/bash
for i in {57201..57206}
do
mkdir destination/$i
done
tcp="tcp port "
for i in {57201..57206}
do
tcp="$tcp$i"
pcapsplitter -f file.pcapng -o destination/$i -m bpf-filter -p $tcp
done
the question is, can I use bash for my goal or not?
If yes, why it does not work?
Definitely, this is something Bash can do.
Regarding your script, the first thing I can think of is this line :
pcapsplitter -f file.pcapng -o destination/$i -m bpf-filter -p $tcp
where the value of $tcp is actually tcp port 57201 (and following numbers on the next rounds. However, without quotes, you're actually passing tcp only to the -p parameter.
It should work better after you've changed this line into :
pcapsplitter -f file.pcapng -o destination/$i -m bpf-filter -p "$tcp"
NB: as a general advice, it's usually safer to double-quote variables in Bash.
NB2 : you don't need those 2 for loops. Here is how I'd rewrite your script :
#!/bin/bash
for portNumber in {57201..57206}; do
destinationDirectory="destination/$portNumber"
mkdir "$destinationDirectory"
thePparameter="tcp port $portNumber"
pcapsplitter -f 'file.pcapng' -o "$destinationDirectory" -m bpf-filter -p "$thePparameter"
done

Submit SGE job array with random file names

I have a script that was kicking off ~200 jobs for each sub-analysis. I realized that a job array would probably be much better for this for several reasons. It seems simple enough but is not quite working for me. My input files are not numbered so I've following examples I've seen I do this first:
INFILE=`sed -n ${SGE_TASK_ID}p <pathto/listOfFiles.txt`
My qsub command takes in quite a few variables as it is both pulling and outputting to different directories. $res does not change, however $INFILE is what I am looping through.
qsub -q test.q -t 1-200 -V -sync y -wd ${res} -b y perl -I /master/lib/ myanalysis.pl -c ${res}/${INFILE}/configFile-${INFILE}.txt -o ${res}/${INFILE}/
Since this was not working, I was curious as to what exactly was being passed. So I did an echo on this and saw that it only seems to expand up to the first time $INFILE is used. So I get:
perl -I /master/lib/ myanalysis.pl -c mydirectory/fileABC/
instead of:
perl -I /master/lib/ myanalysis.pl -c mydirectory/fileABC/configFile-fileABC.txt -o mydirectory/fileABC/
Hoping for some clarity on this and welcome all suggestions. Thanks in advance!
UPDATE: It doesn't look like $SGE_TASK_ID is set on the cluster. I looked for any variable that could be used for an array ID and couldn't find anything. If I see anything else I will update again.
Assuming you are using a grid engine variant then SGE_TASK_ID should be set within the job. It looks like you are expecting it to be set to some useful variable before you use qsub. Submitting a script like this would do roughly what you appear to be trying to do:
#!/bin/bash
INFILE=$(sed -n ${SGE_TASK_ID}p <pathto/listOfFiles.txt)
exec perl -I /master/lib/ myanalysis.pl -c ${res}/${INFILE}/configFile-${INFILE}.txt -o ${res}/${INFILE}/
Then submit this script with
res=${res} qsub -q test.q -t 1-200 -V -sync y -wd ${res} myscript.sh
`

Tesseract OCR large number of files

I have around 135000 .TIF files (1.2KB to 1.4KB) sitting on my hard drive. I need to extract text out of those files. If I run tesseract as a cron job I am getting 500 to 600 per hour at the most. Can anyone suggest me strategies so I can get atleast 500 per minute?
UPDATE:
Below is my code after implementing on suggestions given by #Mark still I dont seem to go beyond 20 files per min.
#!/bin/bash
cd /mnt/ramdisk/input
function tess()
{
if [ -f /mnt/ramdisk/output/$2.txt ]
then
echo skipping $2
return
fi
tesseract --tessdata-dir /mnt/ramdisk/tessdata -l eng+kan $1 /mnt/ramdisk/output/$2 > /dev/null 2>&1
}
export -f tess
find . -name \*.tif -print0 | parallel -0 -j100 --progress tess {/} {/.}
You need GNU Parallel. Here I process 500 TIF files of 3kB each in 37s on an iMac. By way of comparison, the same processing takes 160s if done in a sequential for loop.
The basic command looks like this:
parallel --bar 'tesseract {} {.} > /dev/null 2>&1' ::: *.tif
which will show a progress bar and use all available cores on your machine. Here it is in action:
If you want to see what it would do without actually doing anything, use parallel --dry-run.
As you have 135,000 files it will probably overflow your command line length - you can check with sysctl like this:
sysctl -a kern.argmax
kern.argmax: 262144
So you need to pump the filenames into GNU Parallel on its stdin and separate them with null characters so you don't get problems with spaces:
find . -iname \*.tif -print0 | parallel -0 --bar 'tesseract {} {.} > /dev/null 2>&1'
If you are dealing with very large numbers of files, you probably need to consider the possibility of being interrupted and restarted. You could either mv each TIF file after processing to a subdirectory called processed so that it won't get done again on restarting, or you could test for the existence of the corresponding txt file before processing any TIF like this:
#!/bin/bash
doit() {
if [ -f "${2}.txt" ]; then
echo Skipping $1...
return
fi
tesseract "$1" "$2" > /dev/null 2>&1
}
export -f doit
time parallel --bar doit {} {.} ::: *.tif
If you run that twice in a row, you will see it is near instantaneous the second time because all the processing was done the first time.
If you have millions of files, you could consider using multiple machines in parallel, so just make sure you have ssh logins to each of the machines on your network and then run across 4 machines, including the localhost like this:
parallel -S :,remote1,remote2,remote3 ...
where : is shorthand for the machine on which you are running.

Error while searching for a string using xargs

I am trying to grep for a string as below but running into error shown below,can anyone suggest how to fix it?
find . | xargs grep 'bin data doesn't exist for HY11' -sl
Error:-
args: unmatched single quote; by default quotes are special to xargs unless you use the -0 option
Your grep pattern contains a quotation mark!
Use double quotes round the pattern: "bin doesn't exist for HY11" rather than 'bin ... HY11'.
You also want to add -print0 to the find command, and -0 to xargs.
The better way is to do this all directly:
find . -type f -exec grep -H "bin doesn't exist for HY11" "{}" "+"
That doesn't even need xargs.
If you have GNU Parallel you can run:
find . | parallel -X -q grep "bin data doesn't exist for HY11" -sl
All new computers have multiple cores, but most programs are serial in nature and will therefore not use the multiple cores. However, many tasks are extremely parallelizeable:
Run the same program on many files
Run the same program for every line in a file
Run the same program for every block in a file
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
A personal installation does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Resources