MANTA (Structural Variant Detection Tool) on Centos 8 - bioinformatics

Good evening,
I have Centos 8 as operative system (OS) installed on my working station. I would like to install MANTA, but Centos 8 is not listed among the OSs on which this tool has been tested (https://github.com/Illumina/manta/blob/master/docs/userGuide/installation.md#prerequisites-to-build-from-source).
I wanted to know if somebody ever used it on Centos 8, or it is a waste of time trying to make it work.
Thank you all!

Install Through Bioconda
This can be installed via Bioconda. I did a quick test on a CentOS 8 Docker image (centos:centos8) and seems to work fine (though I didn't test on an actual BAM), so I think you should be good. Below are the steps.
Install Steps
Install Conda. There are lots of options here, but I'd recommend some Miniforge variant. I specifically installed the Miniforge3-Linux-x86_64 version. I'd note that one should say yes to running conda init, then run source ~/.bashrc when the installer is done.
Install Manta in an environment. Since Manta involves Python scripts, and requires Python 2, you'll need a new environment. Here's the command:
conda create -n manta -c conda-forge -c bioconda -c defaults manta
This installs Manta v1.6.0 for me.
Activate environment. The environment needs to activated to use it.
conda activate manta
Run a Manta script. For example, here's the output for running the config command:
configManta.py
Usage: configManta.py [options]
Version: 1.6.0
This script configures the Manta SV analysis pipeline.
You must specify a BAM or CRAM file for at least one sample.
Configuration will produce a workflow run script which
can execute the workflow on a single node or through
sge and resume any interrupted execution.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
--config=FILE provide a configuration file to override defaults in
global config file (/opt/miniforge3/envs/manta/share/m
anta-1.6.0-1/bin/configManta.py.ini)
--allHelp show all extended/hidden options
Workflow options:
--bam=FILE, --normalBam=FILE
Normal sample BAM or CRAM file. May be specified more
than once, multiple inputs will be treated as each BAM
file representing a different sample. [optional] (no
default)
--tumorBam=FILE, --tumourBam=FILE
Tumor sample BAM or CRAM file. Only up to one tumor
bam file accepted. [optional] (no default)
--exome Set options for WES input: turn off depth filters
--rna Set options for RNA-Seq input. Must specify exactly
one bam input file
--unstrandedRNA Set if RNA-Seq input is unstranded: Allows splice-
junctions on either strand
--referenceFasta=FILE
samtools-indexed reference fasta file [required]
--runDir=DIR Name of directory to be created where all workflow
scripts and output will be written. Each analysis
requires a separate directory. (default:
MantaWorkflow)
--callRegions=FILE Optionally provide a bgzip-compressed/tabix-indexed
BED file containing the set of regions to call. No VCF
output will be provided outside of these regions. The
full genome will still be used to estimate statistics
from the input (such as expected fragment size
distribution). Only one BED file may be specified.
(default: call the entire genome)
Extended options (hidden):

Related

What is the difference between activating an anaconda environment and running its python executable directly?

I have setup multiple python environment using Anaconda.
Usually, to run a script "manually", I would open a command line and then type:
activate my-env
python path/to/my/script.py
Fine.
Now I am trying to run a script automatically using a scheduler and I was wondering what the difference was between
Writing a batch which activates the environment and the executes the script (like in the snippet above)
Calling directly the python executable from the environment (within the envs/my-enjv/ directory) like below:
/path/to/envs/my-env/python.exe path/to/my/script.py
Both seem to work fine. Is there any difference?
I don't claim to be an expert but here's my 2 cents.
For small scripts, no, there isn't a difference.
You should notice a difference when calling external modules / packages. conda activate alters the system path to change how the command shell searches for the appropriate capabilities.
If you supply a full path to an interpreter and the full path to an isolated script, then the shell doesn't need to do a lookup as this has priority over the path. This means you could be in a situation where the interpreter can see the script but cannot see dependencies.
If you follow the conda activate process, and the environment is correctly packaged, then the shell will be able to trace any additional resources.
EDIT: The idea behind this is portability. If an admin has been careful in setting up a system, then scripts should have the appropriate visibility - i.e. see everything in it's environment plus everything in the main system installation.
It's possible to full-path every call to an interpreter and a script or package location, but then what happens when you need to move it to another machine? You would need to spend a lot of time setting everything up exactly as it was before. On the other hand, you can follow the package process and the system path will trace everything for you.
Simply checkout the PATH variable in your environment. After conda activation it has been extended by
\Anaconda3;
\Anaconda3\Library\mingw-w64\bin;
\Anaconda3\Library\usr\bin;
\Anaconda3\Library\bin;
\Anaconda3\Scripts;
\Anaconda3\bin;
This doesn't make much of a difference, if you are just using the standard library in your code. However, if you rely on external packages like pandas, it's a prerequisite so that the modules can be found.

transcrypt issue - cannot run the sample program

Followed the instructions in transcrypt "getting started" docs, I entered the examples 'hello.html' and 'hello.py' in a separate directory.
Entering from the command line: "transcrypt -b -m hello.py" resuleds in the error message: "'transcrypt' is not recognized as an internal or external command,
operable program or batch file."
I'm using python3.6, with transcrypt installed in: C:\program files\python36\lib\site-packages\transcrypt
Any help to activate the sample hello.html would be appreciated.
Could you try python -m transcrypt -b -m hello.py
and tell me what the console output is?
Also: are you on Windows, Linux or OsX?
Answer: Windows 10
[EDIT 1]
Looks like Transcrypt was installed under a different Python distro. Would be good to know what's going on, so please keep us informed. I also have several Python installs on my Windows 10 computer and it can be confusing indeed.
[EDIT 2]
Another possibility is manual installation (although it isn't elegant...). From the docs
http://www.transcrypt.org/docs/html/installation_use.html#installation-troubleshooting-checklist
Alternatively, for manual installation under Windows or Linux, follow
the steps below:
Download the Transcrypt zip and unpack it anywhere you like
Add ../Transcrypt-/transcrypt to your system path
To enable minification, additionally the Java Runtime Environment 6 or later has to be installed
Note If you install Transcrypt manually, Trancrypt is started by typing run_transcrypt rather than transcrypt. This allows a pip
installed Transcrypt and a manually installed Transcrypt to be used
side by side selectively
BTW Thanks for the suggestion on Github. We'll look into it and try to improve the docs on this point. It seems to be quite difficult to really draw up a bullet proof installation procedure for each platform.
You might also find it easier to use python3's built in virtual env, so that you install Transcrypt and other python modules only into one project folder at a time. It's much easier to use than it at first sounds.
Here's how you might do that on Windows 10.
mkdir mynewproject
cd mynewproject
py -3 -m venv myvirtualenv # installs venv files into myvirtualenv
myvirtualenv\Scripts\activate # activates the virtual env
The py -3 command uses the python windows launcher to use the latest version of python 3. The launcher is defined in Pep 397 and the docs are here.
Once the virtual environment is activated, the prompt will change to show that. After which any 'pip install' commands will install packages into 'myvirtualenv', instead of the system wide location. If you want to deactivate it, just type 'deactivate' or close the shell window. You can also just use 'python' to refer to python3 from within the virtual env. This has saved many people from madness.
In case this helps for other newbies. A few problems I encountered when setting up transcrypt.
Path issues: I had multiple versions of python, in different folders: \python26, \python27, and \Program Files\python36.
This caused all sorts of grief, despite setting the environmental path to include the python36 distro. I fixed this issue by renaming the other versions \python26x and \python27x. This left those distros intact if ever I needed to use them, but stopped the system from finding them. Thus, it only found python36
My earlier suggestion of py -3 didn't really solve the multiple distro issue completely after all.
After doing that, I reinstalled transcrypt and it seemed OK (sort of: read on)
Second issue was trying to run the sample hello.py. I tried "transcrypt -b hello.py" and got a "'transcrypt' is not recognized.." message.
But this worked: python -m transcrypt -b -m hello.py
That worked because the system had finally found the correct version of python, due to the above fix.
Similarly, trying to run the sample hello.py as recommended in the docs caused a problem. run_transcrypt -b hello.py
The reason for this was that run_transcrypt resolved to "python $(dirname $0)/main.py $*"
But, because I had python v3.6 installed in c:\Program Files, the batch file run_transcrypt caused this output:
c:\transcrypt>python C:\Program Files\Python36\Lib\site-packages\transcrypt__main__.py -b hello.py
python: can't open file 'C:\Program': [Errno 2] No such file or directory
Consequently, I had to place Program Files in quotes and run it this way:
"C:\Program Files"\Python36\Lib\site-packages\transcrypt__main__.py -b hello.py
or else, as above: python -m transcrypt -b -m hello.py
I think, with respect, the docs should raise a warning flag here for users who have python installed in \Program Files, rather than, for example, in c:\python[x]
Third issue Changing hello.py to "play around" with the code -
I found the files in transcrypt\demos\ to be read-only. To fix this:
1: I opened the command prompt as administrator
2: I ran the attrib command to change the file attributes:
"c:\Program Files\Python36\Lib\site-packages\transcrypt\demos\hello>attrib -r -s -a hello.py"
(Without doing this as administrator you get an access-denied message)
The whole exercise caused a few hours of toing and froing, but it seems that things are better now.

Making Sphinx documentation inside of a virtual environment with cron

I have an application development server that is automatically updated every night with a massive shell script that we run with crontab. The script specifies #!/bin/sh at the top of the file and I am not able to change that. The basic purpose of the script is to go through the machine and download the latest code in each of the directories that we list in the script. After all of the repositories are updated, we execute a number of scripts to update the relevant databases using the appropriate virtual environment (Django manage.py commands) by calling that virtualenv's python directly.
The issue that I am having is that we have all the necessary Sphinx plugins installed in one of the virtual environments to allow us to build the documentation from the code at the end of the script, but I cannot seem to figure out how to allow the make command to run inside of the virtualenv so that it has access to the proper packages and libraries. I need a way to run the make command inside of the virtual environment and if necessary deactivate that environment afterwards so that the remainder of the script can run.
My current script looks like the below and gives errors on the latter 3 lines, because sh does not have workon or deactivate, and because make can't find the sphinx-build.
cd ${_proj_root}/dev/docs
workon dev
make clean && make html
deactivate
I was able to find the answer to this question here. The error message that is shown when you attempt to build the sphinx documentation from the root is as follows, and leads to the answer that was provided there:
Makefile:12: *** The 'sphinx-build' command was not found. Make sure
you have Sphinx installed, then set the SPHINXBUILD environment
variable to point to the full path of the 'sphinx-build' executable.
Alternatively you can add the directory with the executable to your
PATH. If you don't have Sphinx installed, grab it from
http://sphinx-doc.org/. Stop.
The full command for anyone looking to build sphinx documentation through a cron when all tools are installed in various virtual environments are listed below. You can find the location of your python and sphinx-build commands by using which while the environment is activated.
make html SPHINXBUILD='<virtualenv-path-to>/python <virtualenv-path-to>/sphinx-build'

Easiest way to install Python dependencies on Spark executor nodes?

I understand that you can send individual files as dependencies with Python Spark programs. But what about full-fledged libraries (e.g. numpy)?
Does Spark have a way to use a provided package manager (e.g. pip) to install library dependencies? Or does this have to be done manually before Spark programs are executed?
If the answer is manual, then what are the "best practice" approaches for synchronizing libraries (installation path, version, etc.) over a large number of distributed nodes?
Actually having actually tried it, I think the link I posted as a comment doesn't do exactly what you want with dependencies. What you are quite reasonably asking for is a way to have Spark play nicely with setuptools and pip regarding installing dependencies. It blows my mind that this isn't supported better in Spark. The third-party dependency problem is largely solved in general purpose Python, but under Spark, it seems the assumption is you'll go back to manual dependency management or something.
I have been using an imperfect but functional pipeline based on virtualenv. The basic idea is
Create a virtualenv purely for your Spark nodes
Each time you run a Spark job, run a fresh pip install of all your own in-house Python libraries. If you have set these up with setuptools, this will install their dependencies
Zip up the site-packages dir of the virtualenv. This will include your library and it's dependencies, which the worker nodes will need, but not the standard Python library, which they already have
Pass the single .zip file, containing your libraries and their dependencies as an argument to --py-files
Of course you would want to code up some helper scripts to manage this process. Here is a helper script adapted from one I have been using, which could doubtless be improved a lot:
#!/usr/bin/env bash
# helper script to fulfil Spark's python packaging requirements.
# Installs everything in a designated virtualenv, then zips up the virtualenv for using as an the value of
# supplied to --py-files argument of `pyspark` or `spark-submit`
# First argument should be the top-level virtualenv
# Second argument is the zipfile which will be created, and
# which you can subsequently supply as the --py-files argument to
# spark-submit
# Subsequent arguments are all the private packages you wish to install
# If these are set up with setuptools, their dependencies will be installed
VENV=$1; shift
ZIPFILE=$1; shift
PACKAGES=$*
. $VENV/bin/activate
for pkg in $PACKAGES; do
pip install --upgrade $pkg
done
TMPZIP="$TMPDIR/$RANDOM.zip" # abs path. Use random number to avoid clashes with other processes
( cd "$VENV/lib/python2.7/site-packages" && zip -q -r $TMPZIP . )
mv $TMPZIP $ZIPFILE
I have a collection of other simple wrapper scripts I run to submit my spark jobs. I simply call this script first as part of that process and make sure that the second argument (name of a zip file) is then passed as the --py-files argument when I run spark-submit (as documented in the comments). I always run these scripts, so I never end up accidentally running old code. Compared to the Spark overhead, the packaging overhead is minimal for my small scale project.
There are loads of improvements that could be made – eg being smart about when to create a new zip file, splitting it up two zip files, one containing often-changing private packages, and one containing rarely changing dependencies, which don't need to be rebuilt so often. You could be smarter about checking for file changes before rebuilding the zip. Also checking validity of arguments would be a good idea. However for now this suffices for my purposes.
The solution I have come up with is not designed for large-scale dependencies like NumPy specifically (although it may work for them). Also, it won't work if you are building C-based extensions, and your driver node has a different architecture to your cluster nodes.
I have seen recommendations elsewhere to just run a Python distribution like Anaconda on all your nodes since it already includes NumPy (and many other packages), and that might be the better way to get NumPy as well as other C-based extensions going. Regardless, we can't always expect Anaconda to have the PyPI package we want in the right version, and in addition you might not be able to control your Spark environment to be able to put Anaconda on it, so I think this virtualenv-based approach is still helpful.

OSX autotools: `aclocal.m4' not being output by `autom4te'

I have been trying to build the freetype2 library in OSX Mavericks for several weeks now, but without success.
The trouble is with using GNU Autotools to create the configure build script.
I have installed automake, autoconf, libtoolize, m4 and perl5 using the macports port command.
When executing aclocal, there is supposed to be a file created in the configure directory that contains Autotools macros: aclocal.m4. However, this file is not being output, and the subsequent glibtoolize and autoconf commands are generating a spurious configure script.
The result is: no aclocal.m4 file, and the usual contents of ./autom4te.cache/traces.* being dumped at the top of the generated configure file (the traces.* files are empty).
e.g.:
m4trace:configure.ac:14: -1- AC_SUBST([SHELL])
m4trace:configure.ac:14: -1- AC_SUBST_TRACE([SHELL])
m4trace:configure.ac:14: -1- m4_pattern_allow([^SHELL$])
Any help would be greatly appreciated.
GNU Autotools does not support execution over a working directory stored on a FAT32 file system. It results in spurious m4trace debug messages being output to the generated configure script.
It is unknown why this is, but may be related to the reliance on the sleep command to check whether a file has changed. FAT32 rounds time stamps to the nearest second, where execution and subsequent modification checks may happen on a sub-second timescale.
This has been raised with the development team, but for now, I move my working directory to my OSX boot partition before executing GNU Autotools.

Resources