I am working in Bash. I have a series of nested for loops that iteratively look for the presence of three lists of 96 barcodes sequences. My goal is to find each unique combination of barcodes there are 96x96x96 (884,736) possible combinations.
for barcode1 in "${ROUND1_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode1" $FASTQ_R > ROUND1_MATCH.fastq
echo barcode1.is.$barcode1 >> outputLOG
if [ -s ROUND1_MATCH.fastq ]
then
# Now we will look for the presence of ROUND2 barcodes in our reads containing barcodes from the previous step
for barcode2 in "${ROUND2_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode2" ROUND1_MATCH.fastq > ROUND2_MATCH.fastq
if [ -s ROUND2_MATCH.fastq ]
then
# Now we will look for the presence of ROUND3 barcodes in our reads containing barcodes from the previous step
for barcode3 in "${ROUND3_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode3" ./ROUND2_MATCH.fastq | sed '/^--/d' > ROUND3_MATCH.fastq
# If matches are found we will write them to an output .fastq file itteratively labelled with an ID number
if [ -s ROUND3_MATCH.fastq ]
then
mv ROUND3_MATCH.fastq results/result.$count.2.fastq
fi
count=`expr $count + 1`
done
fi
done
fi
done
This code works and I am able to successfully extract the sequences with each barcode combination. However, I think that the speed of this can be improved for working through large files by parallelizing this loop structure. I know that I can use GNU parallel to do this however I am struggling to nest the parallelizations.
# Parallelize nested loops
now=$(date +"%T")
echo "Beginning STEP1.2: PARALLEL Demultiplex using barcodes. Current
time : $now" >> outputLOG
mkdir ROUND1_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} SRR6750041_2_smalltest.fastq > ROUND1_PARALLEL_HITS/{#}_ROUND1_MATCH.fastq' ::: "${ROUND1_BARCODES[#]}"
mkdir ROUND2_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} ROUND1_PARALLEL_HITS/*.fastq > ROUND2_PARALLEL_HITS/{#}_{/.}.fastq' ::: "${ROUND2_BARCODES[#]}"
mkdir ROUND3_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} ROUND2_PARALLEL_HITS/*.fastq > ROUND3_PARALLEL_HITS/{#}_{/.}.fastq' ::: "${ROUND3_BARCODES[#]}"
mkdir parallel_results
parallel -j 6 'mv {} parallel_results/result_{#}.fastq' ::: ROUND3_PARALLEL_HITS/*.fastq
How can I successfully recreate the nested structure of the for loops using parallel?
Parallelized only the inner loop:
for barcode1 in "${ROUND1_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode1" $FASTQ_R > ROUND1_MATCH.fastq
echo barcode1.is.$barcode1 >> outputLOG
if [ -s ROUND1_MATCH.fastq ]
then
# Now we will look for the presence of ROUND2 barcodes in our reads containing barcodes from the previous step
for barcode2 in "${ROUND2_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode2" ROUND1_MATCH.fastq > ROUND2_MATCH.fastq
if [ -s ROUND2_MATCH.fastq ]
then
# Now we will look for the presence of ROUND3 barcodes in our reads containing barcodes from the previous step
doit() {
grep -B 1 -A 2 "$1" ./ROUND2_MATCH.fastq | sed '/^--/d'
}
export -f doit
parallel -j0 doit {} '>' results/$barcode1-$barcode2-{} ::: "${ROUND3_BARCODES[#]}"
# TODO remove files with 0 length
fi
done
fi
done
Related
Issue
I have a few PE fastq files from an infected host. First I map reads to the host and keep reads that did not map and convert that single bam to new paired end fastq files. The next loop takes the new PE fastq files and maps them to the pathogen. The problem I'm facing is the beginning of the second loop does not find the associated R2.fastq. All work is being done on my institution's linux compute cluster.
The appropriate files are created at the end of the first loop and the second loop is able to find the R1 files, but not the F2 files in the same directory. I have stared at this for a couple days now, making changes in an attempt to figure out the naming issue.
Any help determining the issue with the second for loop would be greatly appreciated. Keep in mind this is my first post and my degree is in biology. Please gentle.
Code
#PBS -S /bin/bash
#PBS -l partition=bigmem,nodes=1:ppn=16,walltime=1:00:00:00
#PBS -A ACF-UTK0011
#PBS -M wbrewer5#vols.utk.edu
#PBS -m abe
#PBS -e /lustre/haven/user/wbrewer5/pandora/lowcov/error/
#PBS -o /lustre/haven/user/wbrewer5/pandora/lowcov/log/
#PBS -N PandoraLowCovMapping1
#PBS -n
cd $PBS_O_WORKDIR
set -x
module load samtools
module load bwa
#create indexes for pea aphid and pandora genomes
#bwa index -p pea_aphid.fna pea_aphid.fna
#bwa index -p pandora_canu_pilon.fasta pandora_canu_pilon.fasta
#map read files to the aphid genome and keep reads that do not map
for r1 in `ls /lustre/haven/user/wbrewer5/pandora/lowcov/reads/*R1.fastq`
do
r2=`sed 's/R1.fastq/R2.fastq/' <(echo $r1)`
BASE1=$(basename $r1 | sed 's/_R1.fastq*//g')
echo "r1 $r1"
echo "r2 $r2"
echo "BASE1 $BASE1"
bwa mem -t 16 -v 3 \
pea_aphid.fna \
$r1 \
$r2 |
samtools view -# 16 -u -f 12 -F 256 - |
samtools sort -# 16 -n - |
samtools fastq - \
-1 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/$BASE1\_unmapped_R1.fastq \
-2 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/$BASE1\_unmapped_R2.fastq \
-0 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/$BASE1\_trash.txt \
-s /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/$BASE1\_more_trash.txt
echo "Step 1: mapped reads from $BASE1 to aphid genome and saved to 1_samtools as paired end .fastq"
done
rm /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/*trash*
echo "saving unmapped reads to new fastq files complete!"
for f1 in `ls /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/*unmapped_R1.fastq`
do
f2=`sed 's/R1.fastq/R2.fastq/' >(echo $f1)`
BASE2=$(basename $f1 | sed 's/_R1.fastq*//g')
echo "f1 $f1"
echo "f2 $f2"
echo "BASE2 $BASE2"
bwa mem -t 16 -v 3 \
pandora_canu_pilon.fasta \
$f1 \
$f2 |
samtools sort -# 16 -o ./2_angsd/$BASE2\.bam -
echo "Step 2: mapped reads from $BASE2 to pandora genome saved to 2_angsd as .bam"
done
echo "Mapping new fastq files to pandora genome complete!!"
Log
First file of first loop
++ ls /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_614_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_686_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/p-251_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/p-614_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/p-686_R1.fastq
+ for r1 in '`ls /lustre/haven/user/wbrewer5/pandora/lowcov/reads/*R1.fastq`'
++ sed s/R1.fastq/R2.fastq/ /dev/fd/63
+++ echo /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R1.fastq
+ r2=/lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R2.fastq
++ basename /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R1.fastq
++ sed 's/_R1.fastq*//g'
+ BASE1=Matt_251
+ echo 'r1 /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R1.fastq'
+ echo 'r2 /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R2.fastq'
+ echo 'BASE1 Matt_251'
+ bwa mem -t 16 -v 3 pea_aphid.fna /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R2.fastq
+ samtools view -# 16 -u -f 12 -F 256 -
+ samtools sort -# 16 -n -
+ samtools fastq - -1 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_unmapped_R1.fastq -2 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_unmapped_R2.fastq -0 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_trash.txt -s /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_more_trash.txt
First file of second loop
++ ls /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_unmapped_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_614_unmapped_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_686_unmapped_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/p-251_unmapped_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/p-614_unmapped_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/p-686_unmapped_R1.fastq
+ for f1 in '`ls /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/*unmapped_R1.fastq`'
++ sed s/R1.fastq/R2.fastq/ /dev/fd/63
+++ echo /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_unmapped_R1.fastq
I have the following line of code:
for h in "${Hosts[#]}" ; do echo "$MyLog" | grep -m 1 -B 3 -A 1 $h >> /LogOutput ; done
My hosts variable is a large array of hosts
Is there a better way to do this that doesn't require me to echo on each loop? Like grep on a variable instead?
No echo, no loop
#!/bin/bash
hosts=(host1 host2 host3)
MyLog="
asf host
sdflkj
sadkjf
sdlkjds
lkasf
sfal
asf host2
sdflkj
sadkjf
"
re="${hosts[#]}"
egrep -m 1 -B 3 -A 1 ${re// /|} <<< "$MyLog"
Variant with one echo
echo "$MyLog" | egrep -m 1 -B 3 -A 1 ${re// /|}
Usage
$ ./test
sdlkjds
lkasf
sfal
asf host2
sdflkj
One echo, no loops, and all grepping done in parallel, with GNU Parallel:
echo "$MyLog" | parallel -k --tee --pipe 'grep -m 1 -B 3 -A 1 {}' ::: "${hosts[#]}"
The -k keeps the output in order.
The --tee and the --pipe ensure that the stdin is duplicated to all processes.
The processes that are run in parallel are enclosed in single quotes.
printf your string to multiple-line that you can then grep? Something like:
printf '%s\n' "${Hosts[#]}" | grep -m 1 -B 3 -A 1 $h >> /LogOutput
Assuming you're on GNU system. otherwise info grep
From grep --help
grep --help | head -n1
Output
Usage: grep [OPTION]... PATTERN [FILE]...
So according to that you can do.
for h in "${Hosts[#]}" ; do grep -m 1 -B 3 -A 1 "$h" "$MyLog" >> /LogOutput ; done
I have n folders in destdir. Each folder contains two files: *R1.fastq and *R2.fastq. Using this script, it will do the job (bowtie2) one by one and output {name of the sub folder}.sam in the destdir.
#!/bin/bash
mm9_index="/Users/bowtie2-2.2.6/indexes/mm9/mm9"
destdir=/Users/Desktop/test/outdir/
for f in $destdir/*
do
fbase=$(basename "$f")
echo "Sample $fbase"
bowtie2 -p 4 -x $mm9_index -X 2000 \
-1 "$f"/*R1.fastq \
-2 "$f"/*R2.fastq \
-S $destdir/${fbase}.sam
done
I want to use gnu parallel tool to speed this up, can you help? Thanks.
Use a bash function:
#!/bin/bash
my_bowtie() {
mm9_index="/Users/bowtie2-2.2.6/indexes/mm9/mm9"
destdir=/Users/Desktop/test/outdir/
f="$1"
fbase=$(basename "$f")
echo "Sample $fbase"
bowtie2 -p 4 -x $mm9_index -X 2000 \
-1 "$f"/*R1.fastq \
-2 "$f"/*R2.fastq \
-S $destdir/${fbase}.sam
}
export -f my_bowtie
parallel my_bowtie ::: $destdir/*
For more details: man parallel or http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Calling-Bash-functions
At its simplest, you can normally just put echo on the front of your commands and send the list of commands, that you would have executed sequentially, to GNU Parallel, to execute in parallel, like this:
for f in ...; do
echo bowtie2 -p 4 ....
done | parallel
I have a small bash script to OCR PDF files (slightly modified this script). The basic flow for each file is:
For each page in pdf FILE:
Convert page to TIFF image (imegamagick)
OCR image (tesseract)
Cat results to text file
Script:
FILES=/home/tgr/OCR/input/*.pdf
for f in $FILES
do
FILENAME=$(basename "$f")
ENDPAGE=$(pdfinfo $f | grep "^Pages: *[0-9]\+$" | sed 's/.* //')
OUTPUT="/home/tgr/OCR/output/${FILENAME%.*}.txt"
RESOLUTION=1400
touch $OUTPUT
for i in `seq 1 $ENDPAGE`; do
convert -monochrome -density $RESOLUTION $f\[$(($i - 1 ))\] page.tif
echo processing file $f, page $i
tesseract page.tif tempoutput -l ces
cat tempoutput.txt >> $OUTPUT
done
rm tempoutput.txt
rm page.tif
done
Because of high resolution and fact that tesseract can utilize only one core, the process is extremely slow (takes approx. 3 minutes to convert one PDF file).
Because I have thousands of PDF files I think I can use parallel to use all 4 cores, but I don't get the concept how to use it. In examples I see:
Nested for-loops like this:
(for x in `cat xlist` ; do
for y in `cat ylist` ; do
do_something $x $y
done
done) | process_output
can be written like this:
parallel do_something {1} {2} :::: xlist ylist | process_output
Unfortunately I was not able to figure out how to apply this. How do I parallelize my script?
Since you have 1000s of PDF files it is probably enough simply to parallelize the processing of PDF-files and not parallelize the processing of the pages in a single file.
function convert_func {
f=$1
FILENAME=$(basename "$f")
ENDPAGE=$(pdfinfo $f | grep "^Pages: *[0-9]\+$" | sed 's/.* //')
OUTPUT="/home/tgr/OCR/output/${FILENAME%.*}.txt"
RESOLUTION=1400
touch $OUTPUT
for i in `seq 1 $ENDPAGE`; do
convert -monochrome -density $RESOLUTION $f\[$(($i - 1 ))\] $$.tif
echo processing file $f, page $i
tesseract $$.tif $$ -l ces
cat $$.txt >> $OUTPUT
done
rm $$.txt
rm $$.tif
}
export -f convert_func
parallel convert_func ::: /home/tgr/OCR/input/*.pdf
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial (man parallel_tutorial or http://www.gnu.org/software/parallel/parallel_tutorial.html). You command line
with love you for it.
Read the EXAMPLEs (LESS=+/EXAMPLE: man parallel).
You can have a script like this.
#!/bin/bash
function convert_func {
local FILE=$1 RESOLUTION=$2 PAGE_INDEX=$3 OUTPUT=$4
local TEMP0=$(exec mktemp --suffix ".00.$PAGE_INDEX.tif")
local TEMP1=$(exec mktemp --suffix ".01.$PAGE_INDEX")
echo convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP0" ## Just for debugging purposes.
convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP0"
echo "processing file $FILE, page $PAGE_INDEX" ## I think you mean to place this before the line above.
tesseract "$TEMP0" "$TEMP1" -l ces
cat "$TEMP1".txt >> "$OUTPUT" ## Lines may be mixed up from different processes here and a workaround may still be needed but it may no longer be necessary if outputs are small enough.
rm -f "$TEMP0" "$TEMP1"
}
export -f convert_func
FILES=(/home/tgr/OCR/input/*.pdf)
for F in "${FILES[#]}"; do
FILENAME=${F##*/}
ENDPAGE=$(exec pdfinfo "$F" | grep '^Pages: *[0-9]\+$' | sed 's/.* //')
OUTPUT="/home/tgr/OCR/output/${FILENAME%.*}.txt"
RESOLUTION=1400
touch "$OUTPUT" ## This may no longer be necessary. Or probably you mean to truncate it instead e.g. : > "$OUTPUT"
for (( I = 1; I <= ENDPAGE; ++I )); do
printf "%s\xFF%s\xFF%s\xFF%s\x00" "$F" "$RESOLUTION" "$I" "$OUTPUT"
done | parallel -0 -C $'\xFF' -j 4 -- convert_func '{1}' '{2}' '{3}' '{4}'
done
It exports a function that's importable by parallel, make proper sanitation of arguments, and unique temporary files to make parallel processing possible.
Update. This would hold output on multiple temporary files first before concatenating them to one main output file.
#!/bin/bash
shopt -s nullglob
function convert_func {
local FILE=$1 RESOLUTION=$2 PAGE_INDEX=$3 OUTPUT=$4 TEMPLISTFILE=$5
local TEMP_TIF=$(exec mktemp --suffix ".01.$PAGE_INDEX.tif")
local TEMP_TXT_BASE=$(exec mktemp --suffix ".02.$PAGE_INDEX")
echo "processing file $FILE, page $PAGE_INDEX"
echo convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP_TIF" ## Just for debugging purposes.
convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP_TXT_BASE"
tesseract "$TEMP_TIF" "$TEMP_TXT_BASE" -l ces
echo "$PAGE_INDEX"$'\t'"${TEMP_TXT_BASE}.txt" >> "$TEMPLISTFILE"
rm -f "$TEMP_TIF"
}
export -f convert_func
FILES=(/home/tgr/OCR/input/*.pdf)
for F in "${FILES[#]}"; do
FILENAME=${F##*/}
ENDPAGE=$(exec pdfinfo "$F" | grep '^Pages: *[0-9]\+$' | sed 's/.* //')
BASENAME=${FILENAME%.*}
OUTPUT="/home/tgr/OCR/output/$BASENAME.txt"
RESOLUTION=1400
TEMPLISTFILE=$(exec mktemp --suffix ".00.$BASENAME")
: > "$TEMPLISTFILE"
for (( I = 1; I <= ENDPAGE; ++I )); do
printf "%s\xFF%s\xFF%s\xFF%s\x00" "$F" "$RESOLUTION" "$I" "$OUTPUT"
done | parallel -0 -C $'\xFF' -j 4 -- convert_func '{1}' '{2}' '{3}' '{4}' "$TEMPLISTFILE"
while IFS=$'\t' read -r __ FILE; do
cat "$FILE"
rm -f "$FILE"
done < <(exec sort -n "$TEMPLISTFILE") > "$OUTPUT"
rm -f "$TEMPLISTFILE"
done
its about
http://en.wikipedia.org/wiki/Parallel_(software)
and very rich manpage http://www.gnu.org/software/parallel/man.html
(for x in `cat list` ; do
do_something $x
done) | process_output
is replaced by this
cat list | parallel do_something | process_output
i am trying to implement that on this
while [ "$n" -gt 0 ]
do
percentage=${"scale=2;(100-(($n / $end) * 100))"|bc -l}}
#get url from line specified by n from file done1
nextUrls=`sed -n "${n}p" < done1`
echo -ne "${percentage}% $n / $end urls saved going to line 1. current: $nextUrls\r"
# function that gets links from the url
getlinks $nextUrls
#save n
echo $n > currentLine
let "n--"
let "end=`cat done1 |wc -l`"
done
while reading documentation for gnu parallel
i found out that functions are not supported so getlinks wont be used in parallel
best i have found so far is
seq 30 | parallel -n 4 --colsep ' ' echo {1} {2} {3} {4}
makes output
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
17 18 19 20
21 22 23 24
25 26 27 28
29 30
while loop mentioned above should go like this if I am right
end=`cat done1 |wc -l`
seq $end -1 1 | parallel -j+4 -k
#(all exept getlinks function goes here, but idk how? )|
# everytime it finishes do
getlinks $nextUrls
thx for help in advance
It seems what you want is a progress meter. Try:
cat done1 | parallel --eta wget
If that is not what you want, look at sem (sem is an alias for parallel --semaphore and is normally installed with GNU Parallel):
for i in `ls *.log` ; do
echo $i
sem -j+0 gzip $i ";" echo done
done
sem --wait
In your case it will be something like:
while [ "$n" -gt 0 ]
do
percentage=${"scale=2;(100-(($n / $end) * 100))"|bc -l}}
#get url from line specified by n from file done1
nextUrls=`sed -n "${n}p" < done1`
echo -ne "${percentage}% $n / $end urls saved going to line 1. current: $nextUrls\r"
# function that gets links from the url
THE_URL=`getlinks $nextUrls`
sem -j10 wget $THE_URL
#save n
echo $n > currentLine
let "n--"
let "end=`cat done1 |wc -l`"
done
sem --wait
echo All done
Why does getlinks need to be a function? Take the function and transform it into a shell script (should be essentially identical except you need to export environmental variables in and you of course cannot affect the outside environment without lots of work).
Of course, you cannot save $n into currentline when you are trying to execute in parallel. All files will be overwriting each other at the same time.
i was thinking of makeing something more like this, if not parallel or sam something else because parallel does not supprot funcitons aka http://www.gnu.org/software/parallel/man.html#aliases_and_functions_do_not_work
getlinks(){
if [ -n "$1" ]
then
lynx -image_links -dump "$1" > src
grep -i ".jpg" < src > links1
grep -i "http" < links1 >links
sed -e 's/.*\(http\)/http/g' < links >> done1
sort -f done1 > done2
uniq done2 > done1
rm -rf links1 links src done2
fi
}
func(){
percentage=${"scale=2;(100-(($1 / $end) * 100))"|bc -l}}
#get url from line specified by n from file done1
nextUrls=`sed -n "${$1}p" < done1`
echo -ne "${percentage}% $n / $end urls saved going to line 1. current: $nextUrls\r"
# function that gets links from the url
getlinks $nextUrls
#save n
echo $1 > currentLine
let "$1--"
let "end=`cat done1 |wc -l`"
}
while [ "$n" -gt 0 ]
do
sem -j10 func $n
done
sem --wait
echo All done
My script has become really complex, and i do not want to make a feature unavailable with something i am not sure it can be done
this way i can get links with full internet traffic been used, should take less time that way
tryed sem
#!/bin/bash
func (){
echo 1
echo 2
}
for i in `seq 10`
do
sem -j10 func
done
sem --wait
echo All done
you get
errors
Can't exec "func": No such file or directory at /usr/share/perl/5.10/IPC/Open3.p
m line 168.
open3: exec of func failed at /usr/local/bin/sem line 3168
It is not quite clear what the end goal of your script is. If you are trying to write a parallel web crawler, you might be able to use the below as a template.
#!/bin/bash
# E.g. http://gatt.org.yeslab.org/
URL=$1
# Stay inside the start dir
BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:')
URLLIST=$(mktemp urllist.XXXX)
URLLIST2=$(mktemp urllist.XXXX)
SEEN=$(mktemp seen.XXXX)
# Spider to get the URLs
echo $URL >$URLLIST
cp $URLLIST $SEEN
while [ -s $URLLIST ] ; do
cat $URLLIST |
parallel lynx -listonly -image_links -dump {} \; wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 |
perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and do { $seen{$1}++ or print }' |
grep -F $BASEURL |
grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
mv $URLLIST2 $URLLIST
done
rm -f $URLLIST $URLLIST2 $SEEN