What are the parameters' usage here in the shell code? - shell

hadoop jar cc-jar-with-dependencies.jar com.coupang.pz.cc.merge.Merge_Run \
${IDF_OUT}\
${IG_OUT}\
${PROB_OUT}\
${MERGE_OUT}\
1.00 \
0.000001 \
0.0001 \
There is a piece of shell code and I know the hadoop will run the cc-jar-with-dependencies.jar on hdfs. But what are the meaning of the other parameters below from the second line. Are they the parameters needed for the jar package ?
${...} is the path on hdfs, like ${IDF_OUT} and so on.

The usage of {WORD} is the basic case of Paramter Expansion in bash, shell
$PARAMETER
${PARAMETER}
The easiest form is to just use a parameter's name within braces. This is identical to using $FOO like you see it everywhere, but has the advantage that it can be immediately followed by characters that would be interpreted as part of the parameter name otherwise.
with an example,
word="car"
echo "The plural of $word is most likely $words"
echo "The plural of $word is most likely ${word}s"
produces an output as,
The plural of car is most likely
The plural of car is most likely cars
See the first line not containing cars as expected because shell was able to interpret only ${word} and not $words.
Coming back to your example,
hadoop jar cc-jar-with-dependencies.jar com.coupang.pz.cc.merge.Merge_Run \
${IDF_OUT}\
${IG_OUT}\
${PROB_OUT}\
${MERGE_OUT}\
1.00 \
0.000001 \
0.0001 \
From the second line on-wards, the variables ${IDF_OUT}, ${IG_OUT}, ${PROB_OUT} and ${MERGE_OUT} are all in likelihood some variables (could be environment variables in the hadoop file system) which will get expanded to values when the command is run.
Whilst I have explained what the ${WORD} syntaxes are, the actual purposes of the above variables are not quite relevant in the context of shell.

Those parameters are passed to the hadoop command, so you would need to read the documentation for that command.
However, it might be interesting for you to find out the values contained in these parameters when your script is run. You can do that my modifying the code as shown below :
echo >&2 \
hadoop jar cc-jar-with-dependencies.jar com.coupang.pz.cc.merge.Merge_Run \
${IDF_OUT}\
${IG_OUT}\
${PROB_OUT}\
${MERGE_OUT}\
1.00 \
0.000001 \
0.0001 \
This change will cause the whole command to be printed rather than executed, while the >&2 causes standard output to be output to standard error (which may help getting the data printed to the terminal if there is some output capture going on). Please note that this change is for debugging/curiosity only, it will make your script omit execution of the command.
If you know the values, the whole command is likely be easier to make sense of.

Related

BWA-mem and sambamba read group line error

This is a two-part question:
help interpreting an error;
help with coding.
I'm trying to run bwa-mem and sambamba to aling raw reads to a reference genome and to sort by position. These are the commands I'm using:
bwa mem \
-K 100000000 -v 3 -t 6 -Y \
-R '\#S200031047L1C001R002\S*[1-2]' \
/path/to/reference/GCF_009858895.2_ASM985889v3_genomic.fna \
/path/to/raw-fastq/'S\S[^_]*_L01_[0-9]+-[0-9]+'_1.fq.gz \
/path/to/raw-fastq/'S\S[^_]*_L01_[0-9]+-[0-9]+'_2.fq.gz | \
/path/to/genomics/sambamba-0.8.2 view -S -f bam \
/dev/stdin | \
/path/to/genomics/sambamba-0.8.2 sort \
/dev/stdin \
--out host_removal/${SAMPLE}/${SAMPLE}.hybrid.sorted.bam
This is the error message I'm getting: [E::bwa_set_rg] the read group line is not started with #RG.
My sequences were generated with an MGI sequencer and the readgroups are identified like this #S200031047L1C001R0020000243/1, i.e., they don't beging with an #RG. How can I specify to sambamba that my readgroups start with #S and not #RG?
The commands written above are a published pipeline I'm modifying for my own research. However, among several changes, I'm not confident on how to define sample id as such stated in the last line of the code: --out host_removal/${SAMPLE}/${SAMPLE}.hybrid.sorted.bam (I'm referring to ${SAMPLE}). Any insights?
Thank you very much!
1. Specifying read groups
Your read group string is not correctly formatted. It should be like
'#RG\tID:$ID\tSM:$SM\tLB:$LB\tPU:$PU\tPL:$PL' where the parts beginning with a $ sign should be replaced with the information specific to your sequencing run and sample. Not all of them are required for all purposes. See this read group documentation by GATK team for an example.
Read group specification always begins with #RG. That's part of SAM format. Sequencers do not produce read groups. I think you may be confusing them with fastq header lines. Entries in the read group string are separated by tabs, denoted with \t. Tags and their values are separated by :.
The difference between $ID (read group id) and $SM (sample id) is that sample is the individual or biological sample which may have been sequenced several times in different libraries ($LB). In the GATK documentation they combine flowcell and library into the read group id. Sample and library could make an intuitive read group id in small projects. If you are working on your own project that is not part of a larger sequencing effort, you can define the read groups as you like. If several people work in the same project, you should be consistent to avoid problems later.
2. Variable substitution
I'm not sure if I understood you correctly, but if you are wondering what ${SAMPLE} means in the command, it's a variable called SAMPLE that will be replaced by its value when the command is run. The curly brackets protect the name so that the shell does not confuse the variable name with characters coming after it. See here for examples.

What does two at (#) signs surrounding a string mean in a shell script?

For example,
# Execute the pre-hook.
export SHELL=#shell#
param1=#param1#
param2=#param2#
param3=#param3#
param4=#param4#
param5=#param5#
if test -n "#preHook#"; then
. #preHook#
fi
For context, this is from a shell script in a commit from 2004 in the Nixpkgs repo; tried to see if this maybe a reference feature but string "shell" only occurs once (in a case-sensitive search) in the entire file.
The answer by Chris Dodd is correct, insofar as there's no intrinsic meaning to the shell -- and #foo# is thus commonly used as a sigil. Insofar as you encountered this in nixpkgs, it provides some stdenv tools specifically for implementing this pattern.
As documented at https://nixos.org/manual/nixpkgs/stable/#ssec-stdenv-functions, nixpkgs stdenv provides shell functions including substitute, substituteAll, substituteInPlace &c. which will replace #foo# values with the content of corresponding variables.
In the context of the linked commit, subsitutions of that form can be seen being performed in pkgs/build-wrapper/gcc-wrapper/builder.sh:
sed \
-e "s^#gcc#^$src^g" \
-e "s^#out#^$out^g" \
-e "s^#bash#^$SHELL^g" \
-e "s^#shell#^$shell^g" \
< $gccWrapper > $dst
...is replacing #out# with the value of $out, #bash# with the value of $SHELL, etc.
The # symbol has no meaning to the shell -- it is a punctuation character that will pretty much never occur in any actual shell script.
This makes it a good choice to use for patterns in script templates -- the basic idea being that a simple search-and-replace process will be used (perhaps with a sed script as in the link you show) to rewrite the template into an actual shell script. Every string of the form #name# in the template will be replaced by some other string related to the environment in which the script is being installed.

How to input endlessly by shell script

I am trying to use shell script to generate data to my Kafka topic.
Firstly, I write a shell script run_producer.sh:
#!/bin/sh
./bin/kafka-avro-console-producer --broker-list localhost:9092 --topic AAATest2 \
--property "parse.key=true" \
--property "key.separator=:" \
--property key.schema='{"type":"string"}' \
--property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"measurement","type":"string"},{"name":"id","type":"int"}]}'
It requires you to input string like "key1":{"measurement": "INFO", "id": 1} in command line when the run_producer.sh is executed, and you can input as many as you want.
I write another script add_data.sh:
#!/bin/sh
s="\"key1\":{\"measurement\": \"INFO\", \"id\": 1}"
printf "${s}\n${s}\n" | ./run_producer.sh
It can input the string 2 times, or more by adding "${s}\n" in printf, but it is limited and stupid.
I want to make it inputs the string endlessly until I stop it. How can I do that with shell script ?
I will be very grateful if you can tell me how to make the string differently(different data) by the way.
You could use yes "$s" to produce endless input for your script. But what do you meen by "make the string differently"? Will be enough to use infinite loop with some random data, like
while true; do s="\"key1\":{\"measurement\": \"INFO\", \"id\": $RANDOM}"; echo $s; done
or you need modify it in some other way?
man bash:
RANDOM Each time this parameter is referenced, a random integer between 0 and 32767 is generated. The sequence of random numbers may be initialized by assigning a value to RANDOM. If RANDOM is unset, it loses its special properties, even if it is subsequently reset.
You could combine it with anything else, like: "key_${RANDOM}"
Or choose any other method like https://gist.github.com/earthgecko/3089509 or https://unix.stackexchange.com/questions/230673/how-to-generate-a-random-string

Why do bash parameter expansions cause an rsync command to operate differently?

I am attempting to run an rsync command that will copy files to a new location. If I run the rsync command directly, without any parameter expansions on the command line, rsync does what I expect
$ rsync -amnv --include='lib/***' --include='arm-none-eabi/include/***' \
--include='arm-none-eabi/lib/***' --include='*/' --exclude='*' \
/tmp/from/ /tmp/to/
building file list ... done
created directory /tmp/to
./
arm-none-eabi/
arm-none-eabi/include/
arm-none-eabi/include/_ansi.h
...
arm-none-eabi/lib/
arm-none-eabi/lib/aprofile-validation.specs
arm-none-eabi/lib/aprofile-ve.specs
...
lib/
lib/gcc/
lib/gcc/arm-none-eabi/
lib/gcc/arm-none-eabi/4.9.2/
lib/gcc/arm-none-eabi/4.9.2/crtbegin.o
...
sent 49421 bytes received 6363 bytes 10142.55 bytes/sec
total size is 423195472 speedup is 7586.32 (DRY RUN)
However, if I enclose the filter arguments in a variable, and invoke the command using that variable, different results are observed. rsync copies over a number of extra directories I do not expect:
$ FILTER="--include='lib/***' --include='arm-none-eabi/include/***' \
--include='arm-none-eabi/lib/***' --include='*/' --exclude='*'"
$ rsync -amnv ${FILTER} /tmp/from/ /tmp/to/
building file list ... done
created directory /tmp/to
./
arm-none-eabi/
arm-none-eabi/bin/
arm-none-eabi/bin/ar
...
arm-none-eabi/include/
arm-none-eabi/include/_ansi.h
arm-none-eabi/include/_syslist.h
...
arm-none-eabi/lib/
arm-none-eabi/lib/aprofile-validation.specs
arm-none-eabi/lib/aprofile-ve.specs
...
bin/
bin/arm-none-eabi-addr2line
bin/arm-none-eabi-ar
...
lib/
lib/gcc/
lib/gcc/arm-none-eabi/
lib/gcc/arm-none-eabi/4.9.2/
lib/gcc/arm-none-eabi/4.9.2/crtbegin.o
...
sent 52471 bytes received 6843 bytes 16946.86 bytes/sec
total size is 832859156 speedup is 14041.53 (DRY RUN)
If I echo the command that fails, it generates the exact command that succeeds. Copying the output, and running directly gives me the expected result.
There is obviously something I'm missing about how bash parameter expansion works. Can somebody please explain why the two different invocations produce different results?
The shell parses quotes before expanding variables, so putting quotes in a variable's value doesn't do what you expect -- by the time they're in place, it's too late for them to do anything useful. See BashFAQ #50: I'm trying to put a command in a variable, but the complex cases always fail! for more details.
In your case, it looks like the easiest way around this problem is to use an array rather than a plain text variable. This way, the quotes get parsed when the array is created, each "word" gets stored as a separate array element, and if you reference the variable properly (with double-quotes and [#]), the array elements get included in the command's argument list without any unwanted parsing:
filter=(--include='lib/***' --include='arm-none-eabi/include/***' \
--include='arm-none-eabi/lib/***' --include='*/' --exclude='*')
rsync -amnv "${filter[#]}" /tmp/from/ /tmp/to/
Note that arrays are available in bash and zsh, but not all other POSIX-compatible shells. Also, I lowercased the filter variable name -- recommended practice to avoid colliding with the shell's special variables (which are all uppercase).
I like to break the arguments onto separate lines, for convenience sake:
ROPTIONS=(
-aNHXxEh
--delete
--fileflags
--exclude-from=$EXCLUDELIST
--delete-excluded
--force-change
--stats
--protect-args
)
and then call it thusly:
rsync "${ROPTIONS[#]}" "$SOURCE" "$DESTINATION"

Bash imported variables break configure script when double-quoted

I have a bash script which imports same variables from another static file, which itself uses environment variables set by another file when the script is invoked.
This is the file that gets imported and sets some variables.
# package.mk
PKG_NAME="binutils"
PKG_VERSION="2.24"
PKG_URL_TYPE="http"
PKG_URL="http://ftp.gnu.org/gnu/binutils/${PKG_NAME}-${PKG_VERSION}.tar.bz2"
PKG_DEPENDS=""
PKG_SECTION="devel"
PKG_CONFIGURE_OPTS="--prefix=${TOOLS} \
--target=${TARGET} \
--with-sysroot=${TOOLS}/${TARGET} \
--disable-nls \
--disable-multilib"
It is used by the builds script as so:
#!/bin/bash
# Binutils
. settings/config
pkg_dir="$(locate_package 'binutils')"
. "${pkg_dir}/package.mk"
# etc...
"${CLFS_SOURCES}/${PKG_NAME}-${PKG_VERSION}/configure" "${PKG_CONFIGURE_OPTS}"
# etc...
This script first imports the settings/config file which has a bunch of global variables used by this script and others, and exports them so they are available as environment variables. It then locates the correct package.mk file for the specific component we are building, and imports it as well. So far, so good.
However when I double-quote the options (PKG_CONFIGURE_OPTS) for the configure script:
"${CLFS_SOURCES}/${PKG_NAME}-${PKG_VERSION}/configure" "${PKG_CONFIGURE_OPTS}"`
I get the following error:
gcc: error: unrecognized command line option ‘--with-sysroot=/root/LiLi/target/cross-tools/arm-linux-musleabihf’
If I leave it not quoted like:
"${CLFS_SOURCES}/${PKG_NAME}-${PKG_VERSION}/configure" ${PKG_CONFIGURE_OPTS}`
it works fine (--with-sysroot= is indeed a valid configure flag for binutils).
Why is this? What can I change so that I can double-quote that portion (going by the bash wisdom that one should double-quote just about everything).
Quoting the variable means the entire thing is passed as a single argument, spaces and newlines included. You want word splitting to be performed so that the string is treated as multiple arguments. That's why leaving it unquoted works.
If you're looking for the "right" way to handle this, I recommend using an array. An array can hold multiple values while also preserving whitespace properly.
PKG_CONFIGURE_OPTS=(--prefix="$TOOLS"
--target="$TARGET"
--with-sysroot="$TOOLS/$TARGET"
--disable-nls
--disable-multilib)
...
"$CLFS_SOURCES/$PKG_NAME-$PKG_VERSION/configure" "${PKG_CONFIGURE_OPTS[#]}"

Resources