Passing multiple arguments to parallel function when uploading to FTP - bash

I'm using ncftpput to upload images to a ftp server.
An example of the script is
# destination. origin
ncftpput -R ftp_server icon_d2/cape_cin ./cape_cin_*.png
ncftpput -R ftp_server icon_d2/t_v_pres ./t_v_pres_*.png
ncftpput -R ftp_server icon_d2/it/cape_cin ./it/cape_cin_*.png
ncftpput -R ftp_server icon_d2/it/t_v_pres ./it/t_v_pres_*.png
I'm trying to parallelize this with GNU parallel but I'm struggling to pass the arguments to ncftpput. I know what I'm doing wrong but somehow can not find the solution.
If I construct the array of what I need to upload
images_output=("cape_cin" "t_v_pres")
# suffix for naming
projections_output=("" "it/")
# remote folder on server
projections_output_folder=("icon_d2" "icon_d2/it")
# Create a list of all the images to upload
upload_elements=()
for i in "${!projections_output[#]}"; do
for j in "${images_output[#]}"; do
upload_elements+=("${projections_output_folder[$i]}/${j} ./${projections_output[$i]}${j}_*.png")
done
done
Then I can do the upload in serial like this
for k in "${upload_elements[#]}"; do
ncftpput -R ftp_server ${k}
done
When using parallel I'm using colsep to separate the arguments
parallel -j 5 --colsep ' ' ncftpput -R ftp_server ::: "${upload_elements[#]}"
but ncftpput gives an error that tells me it is not understanding the structure of the passed argument.
What am I doing wrong?

Try:
parallel -j 5 --colsep ' ' eval ncftpput -R ftp_server ::: "${upload_elements[#]}"
This should do exactly the same:
for k in "${upload_elements[#]}"; do
echo ncftpput -R ftp_server ${k}
done | parallel -j 5

Related

Using bash how to iterate through lines in a txt file and sequentially pair up every two lines

Hi I am attempting to use bash to iterate through a .txt file which contains the following lines. This is a smaller subset of the full list of fastq files, but all samples follow the same patterns.
/path/10-xxx-111-sample_S1_R1.fastq.gz
/path/10-xxx-111-sample_S1_R2.fastq.gz
/path/12-xxx-222-sample_S2_R1.fastq.gz
/path/12-xxx-222-sample_S2_R2.fastq.gz
/path/13-xxx-333-sample_S3_R1.fastq.gz
/path/13-xxx-333-sample_S3_R2.fastq.gz
And the aim is to pair every two lines and use the paths to provide information to further code in bash.
bwa mem ${index} ${r1} ${r2} -M -t 8 \
-R "#RG\tID:FlowCell.${name}\tSM:${name}\tPL:illumina\tLB:${Job}.${name}" | \
samtools sort -O bam -o ${bam}/${name}_bwa_output.bam
The first R1 and R2 should should correspond to ${r1} and ${r2} respectively, and in a sequential order.
The ${name}'s are contained in another file and consist of "10-xxx-111-sample_S1_" type information.
Any help in iterating through this text file to inform the downstream code would be really appreciated.
Intended output: First two lines of the .txt file will inform downstream code. e.g.
bwa mem ${index} /path/10-xxx-111-sample_S1_R1.fastq.gz /path/10-xxx-111-sample_S1_R2.fastq.gz -M -t 8 \
Following this, the next two lines will inform the downstream code and so forth. e.g.
bwa mem ${index} /path/12-xxx-222-sample_S2_R1.fastq.gz /path/12-xxx-222-sample_S2_R2.fastq.gz -M -t 8 \
Why not reading 2 lines at a time? Remove echo before bwa... when you'll be satisfied with the result.
$ cat myScript.sh
#!/usr/bin/env bash
while IFS= read -r r1 && IFS= read -r r2; do
echo bwa mem "${index}" "$r1" "$r2" -M -t 8 ...
done < fastq_filepaths.txt
$ index=myIndex ./myScript.sh
bwa mem myIndex /path/10-xxx-111-sample_S1_R1.fastq.gz /path/10-xxx-111-sample_S1_R2.fastq.gz -M -t 8 ...
bwa mem myIndex /path/12-xxx-222-sample_S2_R1.fastq.gz /path/12-xxx-222-sample_S2_R2.fastq.gz -M -t 8 ...
bwa mem myIndex /path/13-xxx-333-sample_S3_R1.fastq.gz /path/13-xxx-333-sample_S3_R2.fastq.gz -M -t 8 ...

Issue with download multiple file with names in BASH

I'm trying to download multiple files in parallel using xargs. Things worked so well if I only download the file without given name. echo ${links[#]} | xargs -P 8 -n 1 wget. Is there any way that allow me to download with filename like wget -O [filename] [URL] but in parallel?
Below is my work. Thank you.
links=(
"https://apod.nasa.gov/apod/image/1901/sombrero_spitzer_3000.jpg"
"https://apod.nasa.gov/apod/image/1901/orionred_WISEantonucci_1824.jpg"
"https://apod.nasa.gov/apod/image/1901/20190102UltimaThule-pr.png"
"https://apod.nasa.gov/apod/image/1901/UT-blink_3d_a.gif"
"https://apod.nasa.gov/apod/image/1901/Jan3yutu2CNSA.jpg"
)
names=(
"file1.jpg"
"file2.jpg"
"file3.jpg"
"file4.jpg"
"file5.jpg"
)
echo ${links[#]} ${names[#]} | xargs -P 8 -n 1 wget
With GNU Parallel you can do:
parallel wget -O {2} {1} ::: "${links[#]}" :::+ "${names[#]}"
If a download fails, GNU Parallel can also retry commands with --retry 3.

Check if a remote file exists in bash

I am downloading files with this script:
parallel --progress -j16 -a ./temp/img-url.txt 'wget -nc -q -P ./images/ {}; wget -nc -q -P ./images/ {.}_{001..005}.jpg'
Would it be possible to not download files, just check them on the remote side and if exists create a dummy file instead of downloading?
Something like:
if wget --spider $url 2>/dev/null; then
#touch img.file
fi
should work, but I don't know how to combine this code with GNU Parallel.
Edit:
Based on Ole's answer I wrote this piece of code:
#!/bin/bash
do_url() {
url="$1"
wget -q -nc --method HEAD "$url" && touch ./images/${url##*/}
#get filename from $url
url2=${url##*/}
wget -q -nc --method HEAD ${url%.jpg}_{001..005}.jpg && touch ./images/${url2%.jpg}_{001..005}.jpg
}
export -f do_url
parallel --progress -a urls.txt do_url {}
It works, but it fails for some files. I can not find consistency why it works for some files, why it fails for others. Maybe it has something with the last filename. Second wget tries to access the currect url, but the touch command after that simply does not create the desidered file. First wget always (correctly) downloads the main image without the _001.jpg, _002.jpg.
Example urls.txt:
http://host.com/092401.jpg (works correctly, _001.jpg.._005.jpg are downloaded)
http://host.com/HT11019.jpg (not works, only the main image is downloaded)
It is pretty hard to understand what it is you really want to accomplish. Let me try to rephrase your question.
I have urls.txt containing:
http://example.com/dira/foo.jpg
http://example.com/dira/bar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.org/dira/foo.jpg
On example.com these URLs exist:
http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_005.jpg
http://example.com/dira/bar_000.jpg
http://example.com/dira/bar_002.jpg
http://example.com/dira/bar_004.jpg
http://example.com/dira/fubar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.com/dirb/baz_001.jpg
http://example.com/dirb/baz_005.jpg
On example.org these URLs exist:
http://example.org/dira/foo_001.jpg
Given urls.txt I want to generate the combinations with _001.jpg .. _005.jpg in addition to the original URL. E.g.:
http://example.com/dira/foo.jpg
becomes:
http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_002.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_004.jpg
http://example.com/dira/foo_005.jpg
Then I want to test if these URLs exist without downloading the file. As there are many URLs I want to do this in parallel.
If the URL exists I want an empty file created.
(Version 1): I want the empty file created in a the similar directory structure in the dir images. This is needed because some of the images have the same name, but in different dirs.
So the files created should be:
images/http:/example.com/dira/foo.jpg
images/http:/example.com/dira/foo_001.jpg
images/http:/example.com/dira/foo_003.jpg
images/http:/example.com/dira/foo_005.jpg
images/http:/example.com/dira/bar_000.jpg
images/http:/example.com/dira/bar_002.jpg
images/http:/example.com/dira/bar_004.jpg
images/http:/example.com/dirb/foo.jpg
images/http:/example.com/dirb/baz.jpg
images/http:/example.com/dirb/baz_001.jpg
images/http:/example.com/dirb/baz_005.jpg
images/http:/example.org/dira/foo_001.jpg
(Version 2): I want the empty file created in the dir images. This can be done because all the images have unique names.
So the files created should be:
images/foo.jpg
images/foo_001.jpg
images/foo_003.jpg
images/foo_005.jpg
images/bar_000.jpg
images/bar_002.jpg
images/bar_004.jpg
images/baz.jpg
images/baz_001.jpg
images/baz_005.jpg
(Version 3): I want the empty file created in the dir images called the name from urls.txt. This can be done because only one of _001.jpg .. _005.jpg exists.
images/foo.jpg
images/bar.jpg
images/baz.jpg
#!/bin/bash
do_url() {
url="$1"
# Version 1:
# If you want to keep the folder structure from the server (similar to wget -m):
wget -q --method HEAD "$url" && mkdir -p images/"$2" && touch images/"$url"
# Version 2:
# If all the images have unique names and you want all images in a single dir
wget -q --method HEAD "$url" && touch images/"$3"
# Version 3:
# If all the images have unique names when _###.jpg is removed and you want all images in a single dir
wget -q --method HEAD "$url" && touch images/"$4"
}
export -f do_url
parallel do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg
GNU Parallel takes a few ms per job. When your jobs are this short, the overhead will affect the timing. If none of your CPU cores are running at 100% you can run more jobs in parallel:
parallel -j0 do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg
You can also "unroll" the loop. This will save 5 overheads per URL:
do_url() {
url="$1"
# Version 2:
# If all the images have unique names and you want all images in a single dir
wget -q --method HEAD "$url".jpg && touch images/"$url".jpg
wget -q --method HEAD "$url"_001.jpg && touch images/"$url"_001.jpg
wget -q --method HEAD "$url"_002.jpg && touch images/"$url"_002.jpg
wget -q --method HEAD "$url"_003.jpg && touch images/"$url"_003.jpg
wget -q --method HEAD "$url"_004.jpg && touch images/"$url"_004.jpg
wget -q --method HEAD "$url"_005.jpg && touch images/"$url"_005.jpg
}
export -f do_url
parallel -j0 do_url {.} :::: urls.txt
Finally you can run more than 250 jobs: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Running-more-than-250-jobs-workaround
You may use curl instead to check if the URLs you are parsing are there without downloading any file as such:
if curl --head --fail --silent "$url" >/dev/null; then
touch .images/"${url##*/}"
fi
Explanation:
--fail will make the exit status nonzero on a failed request.
--head will avoid downloading the file contents
--silent will avoid status or errors from being emitted by the check itself.
To solve the "looping" issue, you can do:
urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[#]}"; do
if curl --head --silent --fail "$url" > /dev/null; then
touch .images/${url##*/}
fi
done
From what I can see, your question isn't really about how to use wget to test for the existence of a file, but rather on how to perform correct looping in a shell script.
Here is a simple solution for that:
urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[#]}"; do
if wget -q --method=HEAD "$url"; then
touch .images/${url##*/}
fi
done
What this does is that it invokes Wget with the --method=HEAD option. With the HEAD request, the server will simply report back whether the file exists or not, without returning any data.
Of course, with a large data set this is pretty inefficient. You're creating a new connection to the server for every file you're trying. Instead, as suggested in the other answer, you could use GNU Wget2. With wget2, you can test all of these in parallel, and use the new --stats-server option to find a list of all the files and the specific return code that the server provided. For example:
$ wget2 --spider --progress=none -q --stats-site example.com/{,1,2,3}
Site Statistics:
http://example.com:
Status No. of docs
404 3
http://example.com/3 0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
http://example.com/1 0 bytes (gzip) : 0 bytes (decompressed), 241ms (transfer) : 241ms (response)
http://example.com/2 0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
200 1
http://example.com/ 0 bytes (identity) : 0 bytes (decompressed), 231ms (transfer) : 231ms (response)
You can even get this data printed as a CSV or JSON for easier parsing
Just loop over the names?
for uname in ${url%.jpg}_{001..005}.jpg
do
if wget --spider $uname 2>/dev/null; then
touch ./images/${uname##*/}
fi
done
You could send a command via ssh to see if the remote file exists and cat it if it does:
ssh your_host 'test -e "somefile" && cat "somefile"' > somefile
Could also try scp which supports glob expressions and recursion.

Insert string after match in variable

I am trying to make some workaround to solve a problem.
We have a gtk+ program that call a bash script who calls rdesktop.
In a machine, we discover that the rdesktop call need on extra parameter...
Since i didnt write anything of this code, and i can modify the GTK part of the problem, i can only edit the bash script that make the middle call between the calls.
i have a variable called CMD with something that look like:
rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r printer:HP_Officejet_Pro_8600 -a 16 -u -p -d -g 80% 192.168.0.5
i need to "live edit" this line for when the printer parameter exists, it append ="MS Publisher Imagesetter" after the printer name.
The best i accompplish so far is
ladb#luisdesk ~ $ input="rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r printer:HP_Officejet_Pro_8600 -a 16 -u -p -d -g 80% 192.168.0.5"
ladb#luisdesk ~ $ echo $input | sed s/'printer:.*a /=\"MS Publisher Imagesetter\" '/
Which return me:
rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r ="MS Publisher Imagesetter" 16 -u -p -d -g 80% 192.168.0.5
Almost this, but i need to append the string, not replace it.
help?
Edit: i pasted incomplete exemples. fixed
Edit2:
With the help of who respond, i end up with
echo "$input" | sed 's/\(printer:\)\([^ ]*\)/\1\2="MS Publisher Imagesetter"/'
If you want the output to look like:
rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r printer:"HP_Officejet_Pro_8600 MS Publisher Imagesetter" -a 16 -u -p -d -g 80% 192.168.0.5
This sed will do, it matches the printer: part first then the existing printer name and quotes both, if not you can adjust the replacement
variables to put the quotes/spacing where you want:
input="rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r printer:HP_Officejet_Pro_8600 -a 16 -u -p -d -g 80% 192.168.0.5"
echo "$input" | sed 's/\(printer:\)\([^ ]*\)/\1"\2 MS Publisher Imagesetter"/'
output:
rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r printer:"HP_Officejet_Pro_8600 MS Publisher Imagesetter" -a 16 -u -p -d -g 80% 192.168.0.5
You can use this:
sed 's/printer:[^=]\+=/\0 "MS Publisher Imagesetter"/' <<< "$input"
The \0 in the replacement pattern outputs the match itself.

Bash interpreter change arguments order

I have bash script and try run command inside it
That's ok
echo ${something:="zip -r -q $TAG -P $PASS $LOCPATH"}
>zip -r -q evolution -P evolution ~/.gconf/apps/evolution
That's ok too
zip -r -q evolution -P evolution ~/.gconf/apps/evolution
But here order have been changed only when passed values and added strange . -i
zip -r -q $TAG -P $PASS $LOCPATH
>zip error: Nothing to do! (try: zip -r -q -P evolution evolution . -i ~/.gconf/apps/evolution
Thanks for any advice.
BASH FAQ entry #50: "I'm trying to put a command in a variable, but the complex cases always fail!"
something=(zip -r -q "$TAG" -P "$PASS" "$LOCPATH")
"${something[#]}"
Try doing type zip, seems it's aliased to something.
Maybe put the full path of zip to override this ,something like :
/usr/bin/zip

Resources