I have an existing snakemake RNAseq workflow that works fine with a directory tree as below. I need to alter the workflow so that it can accommodate another layer of directories. Currently, I use a python script that os.walks the parent directory and creates a json file for the sample wildcards (json file for sample widlcards also included below). I am not very familiar with python, and it seems to me that adapting the code for an extra layer of directories shouldn't be too difficult and was hoping someone would be kind enough to point me in the right direction.
RNAseqTutorial/
├── Sample_70160
│ ├── 70160_ATTACTCG-TATAGCCT_S1_L001_R1_001.fastq.gz
│ └── 70160_ATTACTCG-TATAGCCT_S1_L001_R2_001.fastq.gz
├── Sample_70161
│ ├── 70161_TCCGGAGA-ATAGAGGC_S2_L001_R1_001.fastq.gz
│ └── 70161_TCCGGAGA-ATAGAGGC_S2_L001_R2_001.fastq.gz
├── Sample_70162
│ ├── 70162_CGCTCATT-ATAGAGGC_S3_L001_R1_001.fastq.gz
│ └── 70162_CGCTCATT-ATAGAGGC_S3_L001_R2_001.fastq.gz
├── Sample_70166
│ ├── 70166_CTGAAGCT-ATAGAGGC_S7_L001_R1_001.fastq.gz
│ └── 70166_CTGAAGCT-ATAGAGGC_S7_L001_R2_001.fastq.gz
├── scripts
├── groups.txt
└── Snakefile
{
"Sample_70162": {
"R1": [ "/gpfs/accounts/SlurmMiKTMC/Sample_70162/Sample_70162.R1.fq.gz"
],
"R2": [ "/gpfs/accounts//SlurmMiKTMC/Sample_70162/Sample_70162.R2.fq.gz"
]
},
{
"Sample_70162": {
"R1": [ "/gpfs/accounts/SlurmMiKTMC/Sample_70162/Sample_70162.R1.fq.gz"
],
"R2": [ "/gpfs/accounts/SlurmMiKTMC/Sample_70162/Sample_70162.R2.fq.gz"
]
}
}
The structure I need to accommodate is below
RNAseqTutorial/
├── part1
│ ├── 030-150-G
│ │ ├── 030-150-GR1_clipped.fastq.gz
│ │ └── 030-150-GR2_clipped.fastq.gz
│ ├── 030-151-G
│ │ ├── 030-151-GR1_clipped.fastq.gz
│ │ └── 030-151-GR2_clipped.fastq.gz
│ ├── 100T
│ │ ├── 100TR1_clipped.fastq.gz
│ │ └── 100TR2_clipped.fastq.gz
├── part2
│ ├── 030-025G
│ │ ├── 030-025GR1_clipped.fastq.gz
│ │ └── 030-025GR2_clipped.fastq.gz
│ ├── 030-131G
│ │ ├── 030-131GR1_clipped.fastq.gz
│ │ └── 030-131GR2_clipped.fastq.gz
│ ├── 030-138G
│ │ ├── 030-138R1_clipped.fastq.gz
│ │ └── 030-138R2_clipped.fastq.gz
├── part3
│ ├── 030-103G
│ │ ├── 030-103GR1_clipped.fastq.gz
│ │ └── 030-103GR2_clipped.fastq.gz
│ ├── 114T
│ │ ├── 114TR1_clipped.fastq.gz
│ │ └── 114TR2_clipped.fastq.gz
├── scripts
├── groups.txt
└── Snakefile
The main script that generates the json file for the sample wildcards is below
for root, dirs, files in os.walk(args):
for file in files:
if file.endswith("fq.gz"):
full_path = join(root, file)
#R1 will be forward reads, R2 will be reverse reads
m = re.search(r"(.+).(R[12]).fq.gz", file)
if m:
sample = m.group(1)
reads = m.group(2)
FILES[sample][reads].append(full_path)
I just can't seem to wrap my head around a way to accommodate that extra layer. Is there another module or function other than os.walk? Could I somehow force os.walk to skip a directory and merge the part and sample prefixes? Any suggestions would be helpful!
Edited to add:
I wasn't clear in describing my problem, and noticed that the second example wasn't representative of the problem, and I fixed the examples accordingly, because the second tree was taken from a directory processed by someone else. Data I get comes in two forms, either samples of only one tissue, where the directory consists of WD, sampled folders, and fastq files, where the fastq files have the same prefix as the sample folders that they reside in. The second example is of samples from two tissues. These tissues must be processed separate from each other. But tissues from both types can be found in separate "Parts", but tissues of the same type from different "Parts" must be processed together. If I could get os.walk to return four tuples, or even use
root,dirs,files*=os.walk('Somedirectory')
where the * would append the rest of the directory string to the files variable. Unfortunately, this method does not go to the file level for the third child directory 'root/part/sample/fastq'. In an ideal world, the same snakemake pipeline would be able to handle both scenarios with minimal input from the user. I understand that this may not be possible, but I figured I ask and see if there was a module that could return all portions of each sample directory string.
It seems to me that your problem doesn't have much to do on how to accommodate the second layer. Instead, the question is about the specifications of the directory trees and file names you expect.
In the first case, it seems you can extract the sample name from the first part of the file name. In the second case, file names are all the same and the sample name comes from the parent directory. So, either you implement some logic that tells which naming scheme you are parsing (and this depends on who/what provides the files) or you always extract the sample name from the parent directory as this should work also for the first case (but again, assuming you can rely on such naming scheme).
If you want to go for the second option, something like this should do:
FILES = {}
for root, dirs, files in os.walk('RNAseqTutorial'):
for file in files:
if file.endswith("fastq.gz"):
sample = os.path.basename(root)
full_path = os.path.join(root, file)
if sample not in FILES:
FILES[sample]= {}
if 'R1' in file:
reads = 'R1'
elif 'R2' in file:
reads = 'R2'
else:
raise Exception('Unexpected file name')
if reads not in FILES[sample]:
FILES[sample][reads] = []
FILES[sample][reads].append(full_path)
Not sure if I understand correctly, but here you go:
for root, dirs, files in os.walk(args):
for file in files:
if file.endswith("fq.gz"):
full_path = join(root, file)
reads = 'R1' if 'R1' in file else 'R2'
sample = root.split('/')[-1]
FILES[sample][reads].append(full_path)
Related
I am a relative beginner developing a Python package. At the root of the repository there are two important directories: images and docs. The former contains some png and svg files I would like to put inside a documentation, the latter is where I run sphinx-quickstart in. I cannot change that layout therefore I have to let Sphinx know to use the top-level images directory while building the docs.
According to what I found over the internet I adjusted the conf.py file to have:
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static', '../images']
And in the index.rst I have to point to the image file itself:
.. image:: ../images/scheme.svg
:width: 500
:alt: schematic
:align: center
Having these two set up I run make html and I do get clean logs but the output directory is a little strange... Once the build is finished i have a docs/_build/html directory which contains _static and _images sub-directories (among many others). What I find strange is that inside docs/_build/html/_static I see all the contents of the root-level images being copied over whereas (at the same time) inside docs/_build/html/_images I only have scheme.svg. So essentially this one file is duplicated into these two subdirectories...
This does not look very clean to me... how should I adjust this setup?
Reply to the comment of bad_coder:
Below I will paste a tree with the dir structure (kept only the relevant elements):
.
├── docs
│ ├── Makefile
│ ├── _build
│ │ └── html
│ │ ├── _images
│ │ │ └── scheme.svg
│ │ ├── _static
│ │ │ ├── scheme.svg
│ ├── conf.py
│ ├── index.html
│ ├── index.rst
├── images
│ ├── scheme.svg
I'm having some issues creating unit tests for my Puppet control repository.
I mostly work with roles and profiles with the following directory structure:
[root#puppet]# tree site
site
├── profile
│ ├── files
│ │ └── demo-website
│ │ └── index.html
│ └── manifests
│ ├── base.pp
│ ├── ci_runner.pp
│ ├── docker.pp
│ ├── gitlab.pp
│ ├── logrotate.pp
│ └── website.pp
├── role
│ └── manifests
│ ├── gitlab_server.pp
│ └── nginx_webserver.pp
Where do I need to place my spec files and what are the correct filenames?
I tried placing them here:
[root#puppet]# cat spec/classes/profile_ci_runner_spec.rb
require 'spec_helper'
describe 'profile::ci_runner' do
...
But I get an error:
Could not find class ::profile::ci_runner
The conventional place for a module's spec tests is in the module, with the spec/ directory in the module root. So site/profile/spec/classes/ci_runner_spec.rb, for example.
You could consider installing PDK, which can help you set up the structure and run tests, among other things.
First of all I should say I am a complete newbie to the Yocto world.
I have a working environment that produces my uboot+kernel+rootfs.
I need to add a (complex) driver I have as a subdirectory.
This driver can be compiled natively in the standard way:
here=$(pwd)
make -C /lib/modules/$(uname -r)/build M=$here/bcmdhd modules CONFIG_BCMDHD_PCIE=y CONFIG_BCMDHD=m CONFIG_BCM4359=y
I have seen Integrate out-of-tree driver in kernel and rebuild yocto project image and I have read Yocto Kernel Development Manual.
I tried to follow directions:
Created a directory in .../recipes-kernel beside linux dir.
Copied the source directory in it.
Created a .bb file.
The resulting source tree is:
recipes-kernel/
├── kernel-modules
│ ├── kernel-module-bcmdhd
│ │ └── bcmdhd
│ │ ├── include
│ │ │ ├── include files
│ │ ├── Kconfig
│ │ ├── Makefile
│ │ └── other source files
│ └── kernel-module-bcmdhd_0.1.bb
└── linux
├── linux-imx-4.1.15
│ └── imx
│ └── defconfig
└── linux-imx_4.1.15.bbappend
My BCM89359-mod_0.1.bb contains:
SUMMARY = "Integration of Cypress BCMDHD external Linux kernel module"
LICENSE = "Proprietary"
inherit module
SRC_URI = "file://bcmdhd"
S = "${WORKDIR}"
Unfortunately this doesn't seem to be enough as running bitbake results in no compilation attempted.
I am quite plainly missing something, but I'm unable to understand what.
Any help welcome.
You should have the following source tree:
recipes-kernel/
├── kernel-modules
│ ├── kernel-module-bcm89359_0.1.bb
│ └── kernel-module-bcm89359
│ └ bcmdhd
│ ├ Kconfig
└── linux
├── ...
(For the record)
You can add your module to MACHINE_ESSENTIAL_EXTRA_RDEPENDS += "kernel-module-bcm89359" to local.conf or machine configuration. Also, you can add KERNEL_MODULE_AUTOLOAD = "bcm89359" to load your module automatically.
So I have:
buildSrc/
├── build.gradle
└── src
├── main
│ ├── groovy
│ │ └── build
│ │ ├── ExamplePlugin.groovy
│ │ └── ExampleTask.groovy
│ └── resources
│ └── META-INF
│ └── gradle-plugins
│ └── build.ExamplePlugin.properties
└── test
└── groovy
└── build
├── ExamplePluginTest.groovy
└── ExampleTaskTest.groovy
Question:
It seems like build.ExamplePlugin.properties maps directly to the build.ExamplePlugin.groovy. Is this the case? Seems terribly inefficient to have only one property in the file. Does it have to be fully qualified, i.e. does the name have to exactly match the full qualification of the class?
Now in the example, I see:
project.pluginManager.apply 'build.ExamplePlugin'
...however if I have that in my test, I get an error to the effect that the simple task the plugin defines, is already defined.
Why bother with test examples that require 'apply' when that is inappropriate for packaging?
I'm new to Xcode and just found out that it stores a bunch of user information and other stuff in the project directory that I don't really need in version control or want to put up on Github.
This is what an Xcode project basically looks like:
1 AppName/
2 ├── AppName
3 │ ├── Base.lproj
4 │ │ ├── LaunchScreen.xib
5 │ │ └── Main.storyboard
6 │ ├── Images.xcassets
7 │ │ └── AppIcon.appiconset
8 │ │ └── Contents.json
9 │ ├── AppDelegate.swift
10 │ ├── Info.plist
11 │ └── ViewController.swift
12 ├── AppName.xcodeproj
13 │ ├── project.xcworkspace
14 │ │ ├── xcuserdata
15 │ │ │ └── user1.xcuserdatad
16 │ │ │ └── UserInterfaceState.xcuserstate
17 │ │ └── contents.xcworkspacedata
18 │ ├── xcuserdata
19 │ │ └── user1.xcuserdatad
20 │ │ └── xcschemes
21 │ │ ├── AppName.xcscheme
22 │ │ └── xcschememanagement.plist
23 │ └── project.pbxproj
24 └── AppNameTests
25 ├── AppNameTests.swift
26 └── Info.plist
My inclination is to just commit the AppName/ and AppNameTests/ and exclude the AppName.xcodeproj/ directory. What's the recommended way of doing this?
You'll want to use a .gitignore file to specify which files you don't want to store in GitHub.
Here is how to create the file, and here's what should go in that .gitignore file.
A better question is what should go in my git ignore file. This is a link to the github repo containing the file you need
https://github.com/github/gitignore/blob/master/Global/Xcode.gitignore
Make sure u start with this file so the files are properly ignored because if you don't some files my be added already and you will have to manually remove them.
The "recommended way" really depends on what you want to do with the project. Typically, there are three choices:
check-in only those files which are necessary to build the project
add files that reflect development customizations (such as project files that store the names of the currently-visible files in editors)
generated files, to make a complete snapshot of the project state.
With the last, you can get into problems with timestamps (while git can be told to know something about commit-times — see Checking out old file WITH original create/modified timestamps — few people do it). Without a system that retrieves files using their original timestamps, you end up with a set of files that demand recompilation each time you do a commit.
Even saving the customization files can be problematic, if you move the files to another part of the filesystem (or attempt to share the files with others).
So... use .gitignore to filter out files not needed to build. But check that you can successfully build using a fresh checkout.