BWA-mem and sambamba read group line error

BWA-mem and sambamba read group line error - bioinformatics

This is a two-part question:
help interpreting an error;
help with coding.
I'm trying to run bwa-mem and sambamba to aling raw reads to a reference genome and to sort by position. These are the commands I'm using:
bwa mem \
-K 100000000 -v 3 -t 6 -Y \
-R '\#S200031047L1C001R002\S*[1-2]' \
/path/to/reference/GCF_009858895.2_ASM985889v3_genomic.fna \
/path/to/raw-fastq/'S\S[^_]*_L01_[0-9]+-[0-9]+'_1.fq.gz \
/path/to/raw-fastq/'S\S[^_]*_L01_[0-9]+-[0-9]+'_2.fq.gz | \
/path/to/genomics/sambamba-0.8.2 view -S -f bam \
/dev/stdin | \
/path/to/genomics/sambamba-0.8.2 sort \
/dev/stdin \
--out host_removal/${SAMPLE}/${SAMPLE}.hybrid.sorted.bam
This is the error message I'm getting: [E::bwa_set_rg] the read group line is not started with #RG.
My sequences were generated with an MGI sequencer and the readgroups are identified like this #S200031047L1C001R0020000243/1, i.e., they don't beging with an #RG. How can I specify to sambamba that my readgroups start with #S and not #RG?
The commands written above are a published pipeline I'm modifying for my own research. However, among several changes, I'm not confident on how to define sample id as such stated in the last line of the code: --out host_removal/${SAMPLE}/${SAMPLE}.hybrid.sorted.bam (I'm referring to ${SAMPLE}). Any insights?
Thank you very much!

1. Specifying read groups
Your read group string is not correctly formatted. It should be like
'#RG\tID:$ID\tSM:$SM\tLB:$LB\tPU:$PU\tPL:$PL' where the parts beginning with a $ sign should be replaced with the information specific to your sequencing run and sample. Not all of them are required for all purposes. See this read group documentation by GATK team for an example.
Read group specification always begins with #RG. That's part of SAM format. Sequencers do not produce read groups. I think you may be confusing them with fastq header lines. Entries in the read group string are separated by tabs, denoted with \t. Tags and their values are separated by :.
The difference between $ID (read group id) and $SM (sample id) is that sample is the individual or biological sample which may have been sequenced several times in different libraries ($LB). In the GATK documentation they combine flowcell and library into the read group id. Sample and library could make an intuitive read group id in small projects. If you are working on your own project that is not part of a larger sequencing effort, you can define the read groups as you like. If several people work in the same project, you should be consistent to avoid problems later.
2. Variable substitution
I'm not sure if I understood you correctly, but if you are wondering what ${SAMPLE} means in the command, it's a variable called SAMPLE that will be replaced by its value when the command is run. The curly brackets protect the name so that the shell does not confuse the variable name with characters coming after it. See here for examples.

Related

Sending script and file content via STDIN

I generate (dynamically) a script concatenating the following files:
testscript1
echo Writing File
cat > /tmp/test_file <<EOF
testcontent
line1
second line
testscript2
EOF
echo File is written
And I execute by calling
$ cat testscript1 testcontent testscript2 | ssh remote_host bash -s --
The effect is that the file /tmp/test_file is filled with the desired content.
Is there also a variant thinkable where binary files can be supplied in a similar fashion? Instead of cat of course dd could be used or other Tools, but the problem I see is 'telling' them that the STDIN now ended (can I send ^D through that stream?)
I am not able to get my head around that problem, but there is likely no comparable solution. However, I might be wrong, so I'd be happy to hear from you.
Regards,
Mazze

can I send ^D through that stream
Yes but you don't want to.
Control+D, commonly notated ^D, is just a character -- or to be pedantic (as I often am), a codepoint in the usual character code (ASCII or a superset like UTF-8) that we treat as a character. You can send that character/byte by a number of methods, most simply printf '\004', but the receiving system won't treat it as end-of-file; it will instead be stored in the destination file, just like any other data byte, followed by the subsequent data that you meant to be a new command and file etc.
^D only causes end-of-file when input from a terminal (more exactly, a 'tty' device) -- and then only in 'cooked' mode (which is why programs like vi and less can do things very different from ending a file when you type ^D). The form of ssh you used doesn't make the input a 'tty' device. ssh can make the input (and output) a 'tty' (more exactly a subclass of 'tty' called a pseudo-tty or 'pty', but that doesn't matter here) if you add the -t option (in some situations you may need to repeat it as -t -t or -tt). But then if your binary file contains any byte with the value \004 -- or several other special values -- which is quite possible, then your data will be corrupted and garbage commands executed (sometimes), which definitely won't do what you want and may damage your system.
The traditional approach to what you are trying to do, back in the 1980s and 1990s, was 'shar' (shell archive) and the usual solution to handling binary data was 'uuencode', which converts binary data into only printable characters that could go safely go through a link like this, matched by 'uudecode' which converts it back. See this surviving example from GNU. uuencode and uudecode themselves were part of a communication protocol 'uucp' used mostly for email and Usenet, which are (all) mostly obsolete and forgotten.
However, nearly all systems today contain a 'base64' program which provides equivalent (though not identical) functionality. Within a single system you can do:
base64 <infile | base64 -d >outfile
to get the same effect as cp infile outfile. In your case you can do something like:
{ echo "base64 -d <<END# >outfile"; base64 <infile; echo "END#"; otherstuff; } | ssh remote bash

You can also try:
cat testscript1 testcontent testscript2 | base64 | ssh <options> "base64 --decode | bash"
Don't worry about ^D, because when your input is exhausted, the next processes of the pipeline will notice that they have reached the end of the input file.

What are the parameters' usage here in the shell code?

hadoop jar cc-jar-with-dependencies.jar com.coupang.pz.cc.merge.Merge_Run \
${IDF_OUT}\
${IG_OUT}\
${PROB_OUT}\
${MERGE_OUT}\
1.00 \
0.000001 \
0.0001 \
There is a piece of shell code and I know the hadoop will run the cc-jar-with-dependencies.jar on hdfs. But what are the meaning of the other parameters below from the second line. Are they the parameters needed for the jar package ?
${...} is the path on hdfs, like ${IDF_OUT} and so on.

The usage of {WORD} is the basic case of Paramter Expansion in bash, shell
$PARAMETER
${PARAMETER}
The easiest form is to just use a parameter's name within braces. This is identical to using $FOO like you see it everywhere, but has the advantage that it can be immediately followed by characters that would be interpreted as part of the parameter name otherwise.
with an example,
word="car"
echo "The plural of $word is most likely $words"
echo "The plural of $word is most likely ${word}s"
produces an output as,
The plural of car is most likely
The plural of car is most likely cars
See the first line not containing cars as expected because shell was able to interpret only ${word} and not $words.
Coming back to your example,
hadoop jar cc-jar-with-dependencies.jar com.coupang.pz.cc.merge.Merge_Run \
${IDF_OUT}\
${IG_OUT}\
${PROB_OUT}\
${MERGE_OUT}\
1.00 \
0.000001 \
0.0001 \
From the second line on-wards, the variables ${IDF_OUT}, ${IG_OUT}, ${PROB_OUT} and ${MERGE_OUT} are all in likelihood some variables (could be environment variables in the hadoop file system) which will get expanded to values when the command is run.
Whilst I have explained what the ${WORD} syntaxes are, the actual purposes of the above variables are not quite relevant in the context of shell.

Those parameters are passed to the hadoop command, so you would need to read the documentation for that command.
However, it might be interesting for you to find out the values contained in these parameters when your script is run. You can do that my modifying the code as shown below :
echo >&2 \
hadoop jar cc-jar-with-dependencies.jar com.coupang.pz.cc.merge.Merge_Run \
${IDF_OUT}\
${IG_OUT}\
${PROB_OUT}\
${MERGE_OUT}\
1.00 \
0.000001 \
0.0001 \
This change will cause the whole command to be printed rather than executed, while the >&2 causes standard output to be output to standard error (which may help getting the data printed to the terminal if there is some output capture going on). Please note that this change is for debugging/curiosity only, it will make your script omit execution of the command.
If you know the values, the whole command is likely be easier to make sense of.

Bash: Trying to append to a variable name in the output of a function

this is my very first post on Stackoverflow, and I should probably point out that I am EXTREMELY new to a lot of programming. I'm currently a postgraduate student doing projects involving a lot of coding in various programs, everything from LaTeX to bash, MATLAB etc etc.
If you could explicitly explain your answers that would be much appreciated as I'm trying to learn as I go. I apologise if there is an answer else where that does what I'm trying to do, but I have spent a couple of days looking now.
So to the problem I'm trying to solve: I'm currently using a selection of bioinformatics tools to analyse a range of genomes, and I'm trying to somewhat automate the process.
I have a few sequences with names that look like this for instance (all contained in folders of their own currently as paired files):
SOL2511_S5_L001_R1_001.fastq
SOL2511_S5_L001_R2_001.fastq
SOL2510_S4_L001_R1_001.fastq
SOL2510_S4_L001_R2_001.fastq
...and so on...
I basically wish to automate the process by turning these in to variables and passing these variables to each of the programs I use in turn. So for example my idea thus far was to assign them as wildcards, using the R1 and R2 (which appears in all the file names, as they represent each strand of DNA) as follows:
#!/bin/bash
seq1=*R1_001*
seq2=*R2_001*
On a rudimentary level this works, as it returns the correct files, so now I pass these variables to my first function which trims the DNA sequences down by a specified amount, like so:
# seqtk is the program suite, trimfq is a function within it,
# and the options -b -e specify how many bases to trim from the beginning and end of
# the DNA sequence respectively.
seqtk trimfq -b 10 -e 20 $seq1 >
seqtk trimfq -b 10 -e 20 $seq2 >
So now my problem is I wish to be able to append something like "_trim" to the output file which appears after the >, but I can't find anything that seems like it will work online.
Alternatively, I've been hunting for a script that will take the name of the folder that the files are in, and create a variable for the folder name which I can then give to the functions in question so that all the output files are named correctly for use later on.
Many thanks in advance for any help, and I apologise that this isn't really much of a minimum working example to go on, as I'm only just getting going on all this stuff!
Joe
EDIT
So I modified #ghoti 's for loop (does the job wonderfully I might add, rep for you :D ) and now I append trim_, as the loop as it was before ended up giving me a .fastq.trim which will cause errors later.
Is there any way I can append _trim to the end of the filename, but before the extension?

Explicit is usually better than implied, when matching filenames. Your wildcards may match more than you expect, especially if you have versions of the files with "_trim" appended to the end!
I would be more precise with the wildcards, and use for loops to process the files instead of relying on seqtk to handle multiple files. That way, you can do your own processing on the filenames.
Here's an example:
#!/bin/bash
# Define an array of sequences
sequences=(R1_001 R2_001)
# Step through the array...
for seq in ${sequences[#]}; do
# Step through the files in this sequence...
for file in SOL*_${seq}.fastq; do
seqtk trimfq -b 10 -e 20 "$file" > "${file}.trim"
done
done
I don't know how your folders are set up, so I haven't addressed that in this script. But the basic idea is that if you want the script to be able to manipulate individual filenames, you need something like a for loop to handle the that manipulation on a per-filename basis.
Does this help?
UPDATE:
To put _trim before the extension, replace the seqtk line with the following:
seqtk trimfq -b 10 -e 20 "$file" > "${file%.fastq}_trim.fastq"
This uses something documented in the Bash man page under Parameter Expansion if you want to read up on it. Basically, the ${file%.fastq} takes the $file variable and strips off a suffix. Then we add your extra text, along with the suffix.
You could also strip an extension using basename(1), but there's no need to call something external when you can use something built in to the shell.

Instead of setting variables with the filenames, you could pipe the output of ls to the command you want to run with these filenames, like this:
ls *R{1,2}_001* | xargs -I# sh -c 'seqtk trimfq -b 10 -e 20 "$1" > "${1}_trim"' -- #
xargs -I# will grab the output of the previous command and store it in # to be used by seqtk

script to find similar email users

We have a mail server and I am trying to write a script that will find all users with similar names to avoid malicious users from impersonating legitimate users. For example, a legit user may have the name of james2014#domain.com but a malicious user may register as james20l4#domain.com. The difference, if you notice carefully, is that I replaced the number 'one' with the letter 'l' (el). So I am trying to write something that can consult my /var/vmail/domain/* and find similar names and alert me (the administrator). I will then take the necessary steps to do what I need. Really appreciate any help.

One hacky way to do this is to derive "normalized" versions of your usernames, put those in an associative array as keys mapping to the original input, and use those to find problems.
The example I posted below uses bash associative arrays to store the mapping from normalized name to original name, and tr to switch some characters for other characters (and delete other characters entirely).
I'm assuming that your list of users will fit into memory; you'll also need to tweak the mapping of modified and removed characters to hit your favorite balance between effectiveness and false positives. If your list can't fit in memory, you can use a single file or the filesystem to approximate it, but honestly if you're processing that many names you're probably better off with a non-shell programming language.
Input:
doc
dopey
james2014
happy
bashful
grumpy
james20l4
sleepy
james.2014
sneezy
Script:
#!/bin/bash
# stdin: A list of usernames. stdout: Pairs of names that match.
CHARS_TO_REMOVE="._\\- "
CHARS_TO_MAP_FROM="OISZql"
CHARS_TO_MAP_TO="0152g1"
normalize() {
# stdin: A word. stdout: A modified version of the same word.
exec tr "$CHARS_TO_MAP_FROM" "$CHARS_TO_MAP_TO" \
| tr --delete "$CHARS_TO_REMOVE" \
| tr "A-Z" "a-z"
}
declare -A NORMALIZED_NAMES
while read NAME; do
NORMALIZED_NAME=$(normalize <<< "$NAME")
# -n tests for non-empty strings, as it would be if the name were set already.
if [[ -n ${NORMALIZED_NAMES[$NORMALIZED_NAME]} ]]; then
# This name has been seen before! Print both of them.
echo "${NORMALIZED_NAMES[$NORMALIZED_NAME]} $NAME"
else
# This name has not been seen before. Store it.
NORMALIZED_NAMES["$NORMALIZED_NAME"]="$NAME"
fi
done
Output:
james2014 james20l4
james2014 james.2014

Stream processing lots of stuff to OVA

So one of our developers needs me to batch a bunch of information and process it into an OVA to be presented back for download. This is an easy process using the long method (ie writing to the filesystem), but the developers want a cleaner, streamlined solution that will scale better. They have therefore requested that I stream the entire processes which is proving difficult. Can someone please give me some direction. Here are the steps that need to be accomplished:
Get input from webserver (Webserver will pass these as stream eventually.)
Random password
XML file
Modify boot script on file system (ie insert random password generated by server)
Create ISO of XML file and boot script
Calculate the SHA1 sum of ISO
Append SHA1 sum of ISO to manifest file in OVF directory
Create OVA from OVF directory
Here is an example directory structure (I outlined this in / just for simplicity)
/--
|
|--ISO/
| |
| |--boot.sh (Where the random password gets inserted)
| |--config.xml (This is handed from the web server. Needs to stream from server)
|
|--OVF/
|
|--disk.vmdk
|--ovf.xml
|--manifest.mf (Contains SHA1 of all files in OVF directory)
|--boot.iso (This file will exist once created from ISO directory)
Here is what I have so far (I'll explain the issues afterwards. Yes... there are a lot of issues):
cat /ISO/boot.sh | sed "s%DEFAULT%RANDOM%" | mkisofs /ISO/* | echo "SHA1(boot.iso)= " && sha1sum >> manifest.mf | tar -cvf success.ova /OVF/*
NOTE
In boot.sh there is a variable set to DEFAULT like this (Just for testing purposes):
PASSWORD="DEFAULT"
NOTE
This is what a line in the manifest file should look like:
SHA1(boot.iso)= 5fbc0d70 BLAH BLAH BLAH a91c9121bb
So I've never tried to write an entire script in one stream before. Usually I write to the filesystem a lot as I go. The first issue I see with this is that sed is replacing the string, but what it's piping over to mkisofs will not be used as mkiosfs is just going to make an iso of what it finds in /ISO. I dont even know if you can pass something like that to mkisofs. Piping is sometimes weird to think about.
Next, I think mkisofs is ok because I didnt specify a file output, therefore it should output to stdout which will be passed to sha1sum, but and here is the next problem I see. I need to append some additional text to the file before the SHA1 sum gets added which kinda interrupts the stream.
Finally, the last problem I see is how to pass everything to be tar into OVA without writing to the filesystem first (writing to manifest.mf).
Oh and the last BIG problem which I should have mentioned first is the config.xml file. Right now im dealing with it as just a file. The dev guys want to pass it to this script as a stream as well. I dont have a clue how to handle that.
Any help would be greatly appreciated. These concepts are a little beyond my knowledge.
Thanks!
UPDATE 12/11/13 2:11PM EST
Testing each part individually right now. Will report findings below soon.
UPDATE 12/11/13 2:14PM EST
The following works:
cat /ISO/boot.sh | sed "s%DEFAULT%RANDOM%"
and produces the following output:
RANDOM="RANDOM"
Exactly as expected.
You are correct NeronLeVelu, I will have to come back later and look at sed more carefully when real random passwords are being generated. ie. Making sure proper characters are escaped. Right now though, I'm just testing the logic. I will worry about regex and escaping later. We have not even decided on random password yet. It's only temporary and will most likely be alphanumeric.
Moving onto next part. Still not sure how to take the output from sed (stdout) and use it to include in ISO creation without actually creating a file that gets written to the file system. It may not be possible without writing to file system. More to come soon

# for the password if it contain & \ and separator used in your sed (default is /)
Password4Sed="`echo \"${PASSWORD} | sed \"s/[\\/&]/\\\\&/g\"`"
# no need of a cat with a sed
sed "s/DEFAULT/${Password4Sed}/"/ISO/boot.sh > /tmp/mkisofs.input
Treat rest from this input and put some test to validate each step like empty crc value or mkisofs.input. This will help at runtime when production error occur

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio