GNU parallel read from several files - bash

I am trying to use GNU parallel to convert individual files with a bioinformatic tool called vcf2maf.
My command looks something like this:
${parallel} --link "perl ${vcf2maf} --input-vcf ${1} \
--output-maf ${maf_dir}/${2}.maf \
--tumor-id ${3} \
--tmp-dir ${vcf_dir} \
--vep-path ${vep_script} \
--vep-data ${vep_data} \
--ref-fasta ${fasta} \
--filter-vcf ${filter_vcf}" :::: ${VCF_files} ${results} ${tumor_ids}
VCF_files, results and tumor_ids contain one entry per line and correspond to one another.
When I try and run the command I get the following error for every file:
ERROR: Both input-vcf and output-maf must be defined!
This confused me, because if I run the command manually, the program works as intended, so I dont think that the input/outpit paths are wrong. To confirm this, I also ran
${parallel} --link "cat ${1}" :::: ${VCF_files} ${results} ${tumor_ids},
which correctly prints the contents of the VCF files, whose path is listed in VCF_files.
I am really confused what I did wrong, if anyone could help me out, I'd be very thankful!
Thanks!

For a command this long I would normally define a function:
doit() {
...
}
export -f doit
Then test this on a single input.
When it works:
parallel --link doit :::: ${VCF_files} ${results} ${tumor_ids}
But if you want to use a single command it will look something like:
${parallel} --link "perl ${vcf2maf} --input-vcf {1} \
--output-maf ${maf_dir}/{2}.maf \
--tumor-id {3} \
--tmp-dir ${vcf_dir} \
--vep-path ${vep_script} \
--vep-data ${vep_data} \
--ref-fasta ${fasta} \
--filter-vcf ${filter_vcf}" :::: ${VCF_files} ${results} ${tumor_ids}
GNU Parallel's replacement strings are {1}, {2}, and {3} - not ${1}, ${2}, and ${3}.
--dryrun is your friend when GNU Parallel does not do what you expect it to do.

Related

How to execute bash variable with double quotes and get the output in realtime

I have a variable that has a command that I want to run.
It has a bunch of double-quotes. when I echo it, it looks beautiful.
I can copy-paste it and run it just fine.
I tried simply $cmd, but it doesn't work. I get an error as if the command is malformed.
I then tried running it via eval "$cmd" or similarly, bash -c "$cmd", which works, but I don't get any output until the command is done running.
Example with bash -c "$cmd":
This runs the command, BUT I don't get any output until the command is done running, which sucks and I'm trying to fix that:
cmd="docker run -v \"$PROJECT_DIR\":\"$PROJECT_DIR\" \
-v \"$PPI_ROOT_DIR/utilities/build_deploy/terraform/modules/\":/ppi_modules \
--workdir \"$PROJECT_DIR/terraform\" \
--env TF_VAR_aws_account_id=$AWS_ACCOUNT_ID \
--env TF_VAR_environment=${ENVIRONMENT} \
--env TF_VAR_region=${AWS_DEFAULT_REGION:-us-west-2} \
${OPTIONAL_AWS_ENV_VARS} \
${CUSTOM_TF_VARS} \
${TERRAFORM_BASE_IMAGE} \
init --plugin-dir=/.terraform/providers \
-reconfigure \
-backend-config=\"bucket=${AWS_ACCOUNT_ID}-tf-remote-state\" \
-backend-config=\"key=${ENVIRONMENT}/${PROJECT_NAME}\" \
-backend-config=\"region=us-west-2\" \
-backend-config=\"dynamodb_table=terraform-locks\" \
-backend=true"
# command output looks good. I can copy and paste it and run it my terminal too.
echo $cmd
# Running the command via bash works,
# but I don't get the output until the command is done running,
# which is what I'm trying to fix:
bash -c "$cmd"
Here is an example using bash array.
It prints it to screen perfectly, but just like running it like $cmd, it throws an error as if the command is malformed:
cmd=(docker run -v \"$PROJECT_DIR\":\"$PROJECT_DIR\" \
-v \"$PPI_ROOT_DIR/utilities/build_deploy/terraform/modules/\":/ppi_modules \
--workdir \"$PROJECT_DIR/terraform\" \
--env TF_VAR_aws_account_id=$AWS_ACCOUNT_ID \
--env TF_VAR_environment=${ENVIRONMENT} \
--env TF_VAR_region=${AWS_DEFAULT_REGION:-us-west-2} \
${OPTIONAL_AWS_ENV_VARS} \
${CUSTOM_TF_VARS} \
${TERRAFORM_BASE_IMAGE} \
init --plugin-dir=/.terraform/providers \
-reconfigure \
-backend-config=\"bucket=${AWS_ACCOUNT_ID}-tf-remote-state\" \
-backend-config=\"key=${ENVIRONMENT}/${PROJECT_NAME}\" \
-backend-config=\"region=us-west-2\" \
-backend-config=\"dynamodb_table=terraform-locks\" \
-backend=true)
echo "${cmd[#]}"
"${cmd[#]}"
How can I execute a bash variable that has double-quotes, but run it so I get the output in realtime, just as if I executed via $cmd (which doesn't work)
Similar to these questions, but my question is to run it AND get the output in realtime:
Execute command containing quotes from shell variable
Bash inserting quotes into string before execution
bash script execute command with double quotes, single quotes and spaces
In your array version, double quotes escaped by a backslash become part of the arguments, which is not intended.
So removing backslashes should fix the issue.

Passing Variables in Makefile

I'm using a Makefile to run various docker-compose commands and I'm trying to capture the output of a script run on my local machine and pass that value to a Docker image.
start-service:
VERSION=$(shell aws s3 ls s3://redact/downloads/1.2.3/) && \
docker-compose -f ./compose/docker-compose.yml run \
-e VERSION=$$(VERSION) \
connect make run-service
When I run this I can see the variable being assigned but it still errors. Why is the value not getting passed into the -e argument:
VERSION=1.2.3-build342 && \
docker-compose -f ./compose/docker-compose.yml run --rm \
-e VERSION?=$(VERSION) \
connect make run-connect
/bin/sh: VERSION: command not found
You're mixing several different Bourne shell and Make syntaxes here. The Make $$(VERSION) translates to shell $(VERSION), which is command-substitution syntax; GNU Make $(shell ...) generally expands at the wrong time and isn't what you want here.
If you were writing this as an ordinary shell command it would look like
# Set VERSION using $(...) substitution syntax
# Refer to just plain $VERSION
VERSION=$(aws s3 ls s3://redact/downloads/1.2.3/) && ... \
-e VERSION=$VERSION ... \
So when you use this in a Make context, if none of the variables are Make variables (they get set and used in the same command), just double the $ to $$ not escape them.
start-service:
VERSION=$$(aws s3 ls s3://redact/downloads/1.2.3/) && \
docker-compose -f ./compose/docker-compose.yml run \
-e VERSION=$$VERSION \
connect make run-service

psql return value / error killing the shell script that called it?

I'm running several psql commands inside a bash shell script. One of the commands imports a csv file to a table. The problem is, the CSV file is occasionally corrupt, it has invalid characters at the end and the import fails. When that happens, and I have the ON_ERROR_STOP=on flag set, my entire shell script stops at that point as well.
Here's the relevant bits of my bash script:
$(psql \
-X \
$POSTGRES_CONNECTION_STRING \
-w \
-b \
-L ./output.txt
-A \
-q \
--set ON_ERROR_STOP=on \
-t \
-c "\copy mytable(...) from '$input_file' csv HEADER"\
)
echo "import is done"
The above works fine as long as the csv file isn't corrupt. If it is however, psql spits out a message to the console that begins ERROR: invalid byte sequence for encoding "UTF8": 0xb1 and my bash script apparently stops cold at that point-- my echo statement above doesn't execute, and neither do any other subsequent commands.
Per the psql documentation, a hard stop in psql should return an error code of 3:
psql returns 0 to the shell if it finished normally, 1 if a fatal error of its own occurs (e.g. out of >memory, file not found), 2 if the connection to the server went bad and the session was not >interactive, and 3 if an error occurred in a script and the variable ON_ERROR_STOP was set
That's fine and good, but is there a reason returning a value of 3 should terminate my calling bash script? And can I prevent that? I'd like to keep ON_ERROR_STOP set to on because I actually have other commands I'd like to run in that psql statement if the intial import succeeds, but not if it doesn't.
ON_ERROR_STOP will not work with the -c option.
Also, the $(...) surronding the psql look wrong — do you want to execute the output as a command?
Finally, you forgot a backslash after the -L option
Try using a “here document”:
psql \
-X \
$POSTGRES_CONNECTION_STRING \
-w \
-b \
-L ./output.txt \
-A \
-q \
--set ON_ERROR_STOP=on \
-t <<EOF
\copy mytable(...) from '$input_file' csv HEADER
EOF
echo "import is done"

Unescape the ampersand (&) via XMLStarlet - Bugging &

This a quite annoying but rather a much simpler task. According to this guide, I wrote this:
#!/bin/bash
content=$(wget "https://example.com/" -O -)
ampersand=$(echo '\&')
xmllint --html --xpath '//*[#id="table"]/tbody' - <<<"$content" 2>/dev/null |
xmlstarlet sel -t \
-m "/tbody/tr/td" \
-o "https://example.com" \
-v "a//#href" \
-o "/?A=1" \
-o "$ampersand" \
-o "B=2" -n \
I successfully extract each link from the table and everything gets concatenated correctly, however, instead of reproducing the ampersand as & I receive this at the end of each link:
https://example.com/hello-world/?A=1\&B=2
But actually, I was looking for something like:
https://example.com/hello-world/?A=1&B=2
The idea is to escape the character using a backslash \& so that it gets ignored. Initially, I tried placing it directly into -o "\&" \ instead of -o "$ampersand" \ and removing ampersand=$(echo '\&') in this case scenario. Still the same result.
Essentially, by removing the backslash it still outputs:
https://example.com/hello-world/?A=1&B=2
Only that the \ behind the & is removed.
Why?
I'm sure it is something basic that is missing.
& is the correct way to print & in an XML document, but since you just want a plain URL your output should not be XML. Therefore you need to switch to text mode, by passing --text or -T to the sel command.
Your example input doesn't quite work because example.com doesn't have any table elements, but here is a working example building links from p elements instead.
content=$(wget 'https://example.com/' -O -)
xmlstarlet fo --html <<<"$content" |
xmlstarlet sel -T -t \
-m '//p[a]' \
--if 'not(starts-with(a//#href,"http"))' \
-o 'https://example.com/' \
--break \
-v 'a//#href' \
-o '/?A=1' \
-o '&' \
-o 'B=2' -n
The output is
http://www.iana.org/domains/example/?A=1&B=2
As you have already seen, backslash-escaping isn't the solution here. I can think of two possible options:
Extract the hrefs (probably don't need to be using both xmllint and xmlstarlet to do this), then just use a standard text processing tool such as sed to add the start and the end:
sed 's,^,https://example.com/,; s,$,/?A=1\&B=2,'
Alternatively, pipe the output of what you've currently got to xmlstarlet unesc, which will change & into &.
Sorry I can't reproduce your result but why don't make substitutions? Just filter your results through
sed 's/\\&/\&/g'
add it to your pipe. It should replace all & to &.

Grouping commands inside complex bash expression

I have access to a computing cluster (LSF) and the basic way to send stuff to the compute nodes is by doing:
bsub -I <command>
I had this in a file:
bsub -I ../configure --prefix="..." \
--solver=...\
--with-cflags=...\
&& make -j8 \
&& make install
However I just noticed that actually only the first command (configure) was running on the cluster, the remaining two were running locally. What's the best way to group the whole command and pass it to bsub?
Assuming the bsub you are referring to is the one documented here, you have two options:
Surround the entire command to be executed with single quotes (assuming you don't use a single quote anywhere in the command):
bsub -I '../configure --prefix="..."\
--solver=...\
--with-cflags=...\
&& make -j8 \
&& make install'
Feed the command to bsub's standard input, using a HERE document to avoid quoting issues:
bsub -I <<END
../configure --prefix="..." \
--solver=...\
--with-cflags=...\
&& make -j8 \
&& make install
END
Or, very similar to the second one, put the command into a file and provide the file as input.
bsub -I sh -c '../configure --prefix="..." \
--solver=...\
--with-cflags=...\
&& make -j8 \
&& make install'

Resources