Nextflow: How to process multiple samples - bioinformatics

I have few fq.gz files for few samples. I am trying to process all samples at once using nextflow. But somehow, I am unable to process all the samples at once . But I can process a single sample at once. Here is the data structure and my code for processing single sample.
My nextflow code
params.sampleName="sample1"
params.fastq_path = "data/${params.sampleName}/*{1,2}.fq.gz"
fastq_files = Channel.fromFilePairs(params.fastq_path)
params.ref = "ab.fa"
ref = file(params.ref)
process foo {
input:
set pairId, file(reads) from fastq_files
output:
file("${pairId}.bam") into bamFiles_ch
script:
"""
echo ${reads[0].toRealPath().getParent().baseName}
bwa-mem2 mem -t 8 ${ref} ${reads[0].toRealPath()} ${reads[1].toRealPath()} | samtools sort -#8 -o ${pairId}.bam
samtools index -#8 ${pairId}.bam
"""
}
process samToolsMerge {
publishDir "./aligned_minimap/", mode: 'copy', overwrite: 'false'
input:
file bamFile from bamFiles_ch.collect()
output:
file("**")
script:
"""
samtools merge ${params.sampleName}.bam ${bamFile}
samtools index -# 8 ${params.sampleName}.bam
"""
}
So need help to solve. Thanks in advance.

It looks like you've already built in a way to set your target sample name using:
params.sampleName="sample1"
params.fastq_path = "data/${params.sampleName}/*{1,2}.fq.gz"
To have the glob pattern match all samples, you could simply set the wildcard on the command line using:
nextflow run main.nf --sampleName '*'
Note the quotation marks above. If these are ignored, the glob star will be expanded by your shell before it is passed to your Nextflow command.
The short answer is that you need some easy way to extract the sample name from the parent directory. Then you need some way to group the coordinate-sorted BAMs by the sample name. Below, I've used the new Nextflow DSL 2 but it's not strictly necessary. I just find the new DSL 2 code a lot easier to read and debug. Below is just an example, and you'll need to adapt it to suit your exact use case, but that said, it should do very similar things. It uses a special groupKey so that we can dynamically specify the expected number of elements in each tuple prior to calling the groupTuple operator. This lets us stream the collected values as soon as possible so that each sample can 'merge' when all of it's readgroups have been aligned. Without this, all input readgroups would need to finish alignment before the merge could begin.
Contents of nextflow.config:
process {
shell = [ '/bin/bash', '-euo', 'pipefail' ]
}
Contents of main.nf:
nextflow.enable.dsl=2
params.ref_fasta = "GRCh38.primary_assembly.genome.chr22.fa.gz"
params.fastq_files = "data/*/*.read{1,2}.fastq.gz"
process bwa_index {
conda 'bwa-mem2'
input:
path fasta
output:
path "${fasta}.{0123,amb,ann,bwt.2bit.64,pac}"
"""
bwa-mem2 index "${fasta}"
"""
}
process bwa_mem2 {
tag { [sample, readgroup].join(':') }
conda 'bwa-mem2 samtools'
input:
tuple val(sample), val(readgroup), path(reads)
path bwa_index
output:
tuple val(sample), val(readgroup), path("${readgroup}.bam{,.bai}")
script:
def idxbase = bwa_index.first().baseName
def out_files = [ "${readgroup}.bam", "${readgroup}.bam.bai" ].join('##idx##')
def (r1, r2) = reads
"""
bwa-mem2 mem \\
-R '#RG\\tID:${readgroup}\\tSM:${sample}' \\
-t ${task.cpus} \\
"${idxbase}" \\
"${r1}" \\
"${r2}" |
samtools sort \\
--write-index \\
-# ${task.cpus} \\
-o "${out_files}"
"""
}
process samtools_merge {
tag { sample }
conda 'samtools'
input:
tuple val(sample), path(indexed_bam_files)
output:
tuple val(sample), path("${sample}.bam{,.bai}")
script:
def out_files = [ "${sample}.bam", "${sample}.bam.bai" ].join('##idx##')
def input_bam_files = indexed_bam_files
.findAll { it.name.endsWith('.bam') }
.collect { /"${it}"/ }
.join(' \\\n'+' '*8)
"""
samtools merge \\
--write-index \\
-o "${out_files}" \\
${input_bam_files}
"""
}
workflow {
ref_fasta = file( params.ref_fasta )
bwa_index( ref_fasta )
Channel.fromFilePairs( params.fastq_files ) \
| map { readgroup, reads ->
def (sample_name) = reads*.parent.baseName as Set
tuple( sample_name, readgroup, reads )
} \
| groupTuple() \
| map { sample, readgroups, reads ->
tuple( groupKey(sample, readgroups.size()), readgroups, reads )
} \
| transpose() \
| set { sample_readgroups }
bwa_mem2( sample_readgroups, bwa_index.out )
sample_readgroups \
| join( bwa_mem2.out, by: [0,1] ) \
| map { sample_key, readgroup, reads, indexed_bam ->
tuple( sample_key, indexed_bam )
} \
| groupTuple() \
| map { sample_key, indexed_bam_files ->
tuple( sample_key.toString(), indexed_bam_files.flatten() )
} \
| samtools_merge
}
Run Like:
nextflow run -ansi-log false main.nf
Results:
N E X T F L O W ~ version 21.04.3
Launching `main.nf` [zen_gautier] - revision: dcde9efc8a
Creating Conda env: bwa-mem2 [cache /home/steve/working/stackoverflow/69702077/work/conda/env-8cc153b2eb20a5374bf435019a61c21a]
[63/73c96b] Submitted process > bwa_index
Creating Conda env: bwa-mem2 samtools [cache /home/steve/working/stackoverflow/69702077/work/conda/env-5c358e413a5318c53a45382790eecbd4]
[52/6a92d3] Submitted process > bwa_mem2 (HBR:HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22)
[8b/535b21] Submitted process > bwa_mem2 (UHR:UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22)
[dc/03d949] Submitted process > bwa_mem2 (UHR:UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22)
[e4/bfd08b] Submitted process > bwa_mem2 (HBR:HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22)
[d5/e2aa27] Submitted process > bwa_mem2 (UHR:UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22)
[c2/23ce8a] Submitted process > bwa_mem2 (HBR:HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22)
Creating Conda env: samtools [cache /home/steve/working/stackoverflow/69702077/work/conda/env-912cee20caec78e112a5718bb0c00e1c]
[28/006c03] Submitted process > samtools_merge (HBR)
[3b/51311c] Submitted process > samtools_merge (UHR)

Related

Snakemake Ambiguity between 2 rules with similar output file type

I am trying to map 3 samples: SRR14724459, SRR14724473, and a combination of both SRR14724459_SRR14724473.
I have 2 rules that share a similar file type output (.bam), and even tho I am naming their wildcards different, I still get an ambiguity error:
Building DAG of jobs...
AmbiguousRuleException:
Rules map_hybrid and map are ambiguous for the file /gpfs/scratch/hpadre/snakemake_outputs/mapped_dir/SRR14724459_SRR14724473__0.9.bam.
Expected input files:
map_hybrid: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R1_trimmed_SRR14724473_R1_trimmed_0.9.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R2_trimmed_SRR14724473_R2_trimmed_0.9.fastq
map: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473__0.9/SRR14724459_SRR14724473__0.9_R1_trimmed.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473__0.9/SRR14724459_SRR14724473__0.9_R2_trimmed.fastq
From my Snakefile, here are my variables:
all_samples: ['SRR14724459', 'SRR14724473']
sample1: ['SRR14724459']
sample2: ['SRR14724473']
titration:[0.9]
This is my rule all:
rule all:
expand(MAPPED_DIR + "/{sample}.bam", sample=all_samples),
expand(MAPPED_DIR + "/{sample1}_{sample2}_{titration}.bam", zip, sample1=list_a_titrations, sample2=list_b_titrations, titration=tit_list)
This is my map rule:
rule map:
input:
r1 = TRIMMED_DIR + "/{sample}/{sample}_R1_trimmed.fastq",
r2 = TRIMMED_DIR + "/{sample}/{sample}_R2_trimmed.fastq"
output:
MAPPED_DIR + "/{sample}.bam"
threads: 28
params:
genome = HUMAN_GENOME_DIR
log:
LOG_DIR + "/map/{sample}_map.log"
benchmark:
BENCHMARK_DIR + "/map/{sample}_bwa_benchmark.txt"
wildcard_constraints:
word='[^0-9]*'
shell:
"""
bwa mem -t {threads} {params.genome} {input.r1} {input.r2} 2> {log} | samtools view -hSbo > {output}
"""
This is my map_hybrid:
rule map_hybrid:
input:
r1 = TRIMMED_DIR + "/{sample1}_{sample2}_{titration}/{sample1}_R1_trimmed_{sample2}_R1_trimmed_{titration}.fastq",
r2 = TRIMMED_DIR + "/{sample1}_{sample2}_{titration}/{sample1}_R2_trimmed_{sample2}_R2_trimmed_{titration}.fastq"
output:
MAPPED_DIR + "/{sample1}_{sample2}_{titration}.bam"
threads: 28
params:
genome = HUMAN_GENOME_DIR
log:
LOG_DIR + "/map/{sample1}_{sample2}_{titration}_map.log"
benchmark:
BENCHMARK_DIR + "/map/{sample1}_{sample2}_{titration}_bwa_benchmark.txt"
shell:
"""
set +e
bwa mem -t {threads} {params.genome} {input.r1} {input.r2} 2> {log} | samtools view -hSbo > {output}
exitcode=$?
if [ $exitcode -eq 1 ]
then
exit 1
else
exit 0
fi
"""
The expected input files SHOULD BE as so:
Expected input files:
map_hybrid: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R1_trimmed_SRR14724473_R1_trimmed_0.9.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R2_trimmed_SRR14724473_R2_trimmed_0.9.fastq
map: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_R1_trimmed.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_R2_trimmed.fastq
and also
/home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724473_R1_trimmed.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724473_R2_trimmed.fastq
Your rules map and map_hybrid can both produce your desired files, for snakemake they are ambigious rules.
The names of the wildcards is irrelevant, what is relevant is whether the wildcards in both rules can match the same output filepath.
That is the case here.
While rule map_hybrid can produce the output file SRR14724459_R2_trimmed_SRR14724473_R2_trimmed_0.9.fastq where the wildcard matches are
sample1=SRR14724459
sample2=SRR14724473
the rule map can also produce this output with the wildcard match
sample=SRR14724459_R2_trimmed_SRR14724473
To prevent ambiguity you can use the wildcard_constraint, so that the {sample} wildcard only matches strings starting with SRR followed by numbers:
sample='SRR\d+'
Integrated into your rule map:
rule map:
input:
r1 = TRIMMED_DIR + "/{sample}/{sample}_R1_trimmed.fastq",
r2 = TRIMMED_DIR + "/{sample}/{sample}_R2_trimmed.fastq"
output:
MAPPED_DIR + "/{sample}.bam"
threads: 28
params:
genome = HUMAN_GENOME_DIR
log:
LOG_DIR + "/map/{sample}_map.log"
benchmark:
BENCHMARK_DIR + "/map/{sample}_bwa_benchmark.txt"
wildcard_constraints:
word='[^0-9]*',
sample='SRR\d+'
shell:
"""
bwa mem -t {threads} {params.genome} {input.r1} {input.r2} 2> {log} | samtools view -hSbo > {output}
"""
it should resolve the ambiguity.

Syntax conflict for "{" using Nextflow

New to nextflow, attempted to run a loop in nextflow chunk to remove extension from sequence file names and am running into a syntax error.
params.rename = "sequences/*.fastq.gz"
workflow {
rename_ch = Channel.fromPath(params.rename)
RENAME(rename_ch)
RENAME.out.view()
}
process RENAME {
input:
path read
output:
stdout
script:
"""
for file in $baseDir/sequences/*.fastq.gz;
do
mv -- '$file' '${file%%.fastq.gz}'
done
"""
}
Error:
- cause: Unexpected input: '{' # line 25, column 16.
process RENAME {
^
Tried to use other methods such as basename, but to no avail.
Inside a script block, you just need to escape the Bash dollar-variables and use double quotes so that they can expand. For example:
params.rename = "sequences/*.fastq.gz"
workflow {
RENAME()
}
process RENAME {
debug true
"""
for fastq in ${baseDir}/sequences/*.fastq.gz;
do
echo mv -- "\$fastq" "\${fastq%%.fastq.gz}"
done
"""
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [crazy_brown] DSL2 - revision: 71ada7b0d5
executor > local (1)
[71/4321e6] process > RENAME [100%] 1 of 1 ✔
mv -- /path/to/sequences/A.fastq.gz /path/to/sequences/A
mv -- /path/to/sequences/B.fastq.gz /path/to/sequences/B
mv -- /path/to/sequences/C.fastq.gz /path/to/sequences/C
Also, if you find escaping the Bash variables tedious, you may want to consider using a shell block instead.

Nextflow: Missing output file(s) expected by process

I'm currently making a start on using Nextflow to develop a bioinformatics pipeline. Below, I've created a params.files variable which contains my FASTQ files, and then input this into fasta_files channel.
The process trimming and its scripts takes this channel as the input, and then ideally, I would output all the $sample".trimmed.fq.gz into the output channel, trimmed_channel. However, when I run this script, I get the following error:
Missing output file(s) `trimmed_files` expected by process `trimming` (1)
The nextflow script I'm trying to run is:
#! /usr/bin/env nextflow
params.files = files("$baseDir/FASTQ/*.fastq.gz")
println "fastq files for trimming:$params.files"
fasta_files = Channel.fromPath(params.files)
println "files in the fasta channel: $fasta_files"
process trimming {
input:
file fasta_file from fasta_files
output:
path trimmed_files into trimmed_channel
// the shell script to be run:
"""
#!/usr/bin/env bash
mkdir trimming_report
cd /home/usr/Nextflow
#Finding and renaming my FASTQ files
for file in FASTQ/*.fastq.gz; do
[ -f "\$file" ] || continue
name=\$(echo "\$file" | awk -F'[/]' '{ print \$2 }') #renaming fastq files.
sample=\$(echo "\$name" | awk -F'[.]' '{ print \$1 }') #renaming fastq files.
echo "Found" "\$name" "from:" "\$sample"
if [ ! -e FASTQ/"\$sample"_trimmed.fq.gz ]; then
trim_galore -j 8 "\$file" -o FASTQ #trim the files
mv "\$file"_trimming_report.txt trimming_report #moves to the directory trimming report
else
echo ""\$sample".trimmed.fq.gz exists skipping trim galore"
fi
done
trimmed_files="FASTQ/*_trimmed.fq.gz"
echo \$trimmed_files
"""
}
The script in the process works fine. However, I'm wondering if I'm misunderstanding or missing something obvious. If I've forgot to include something, please let me know and any help is appreciated!
Nextflow does not export the variable trimmed_files to its own scope unless you tell it to do so using the env output qualifier, however doing it that way would not be very idiomatic.
Since you know the pattern of your output files ("FASTQ/*_trimmed.fq.gz"), simply pass that pattern as output:
path "FASTQ/*_trimmed.fq.gz" into trimmed_channel
Some things you do, but probably want to avoid:
Changing directory inside your NF process, don't do this, it entirely breaks the whole concept of nextflow's /work folder setup.
Write a bash loop inside a NF process, if you set up your channels correctly there should only be 1 task per spawned process.
Pallie has already provided some sound advice and, of course, the right answer, which is: environment variables must be declared using the env qualifier.
However, given your script definition, I think there might be some misunderstanding about how best to skip the execution of previously generated results. The cache directive is enabled by default and when the pipeline is launched with the -resume option, additional attempts to execute a process using the same set of inputs, will cause the process execution to be skipped and will produce the stored data as the actual results.
This example uses the Nextflow DSL 2 for my convenience, but is not strictly required:
nextflow.enable.dsl=2
params.fastq_files = "${baseDir}/FASTQ/*.fastq.gz"
params.publish_dir = "./results"
process trim_galore {
tag { "${sample}:${fastq_file}" }
publishDir "${params.publish_dir}/TrimGalore", saveAs: { fn ->
fn.endsWith('.txt') ? "trimming_reports/${fn}" : fn
}
cpus 8
input:
tuple val(sample), path(fastq_file)
output:
tuple val(sample), path('*_trimmed.fq.gz'), emit: trimmed_fastq_files
path "${fastq_file}_trimming_report.txt", emit: trimming_report
"""
trim_galore \\
-j ${task.cpus} \\
"${fastq_file}"
"""
}
workflow {
Channel.fromPath( params.fastq_files )
| map { tuple( it.getSimpleName(), it ) }
| set { sample_fastq_files }
results = trim_galore( sample_fastq_files )
results.trimmed_fastq_files.view()
}
Run using:
nextflow run script.nf \
-ansi-log false \
--fastq_files '/home/usr/Nextflow/FASTQ/*.fastq.gz'

Jenkins pipeline I need to execute the shell command and the result is the value of def variable. What shall I do? Thank you

Jenkins pipeline I need to execute the shell command and the result is the value of def variable.
What shall I do? Thank you
def projectFlag = sh("`kubectl get deployment -n ${namespace}| grep ${project} | wc -l`")
//
if ( "${projectFlag}" == 1 ) {
def projectCI = sh("`kubectl get deployment ${project} -n ${namespace} -o jsonpath={..image}`")
echo "$projectCI"
} else if ( "$projectCI" == "${imageTag}" ) {
sh("kubectl delete deploy ${project} -n ${namespaces}")
def redeployFlag = '1'
echo "$redeployFlag"
if ( "$projectCI" != "${imageTag}" ){
sh("kubectl set image deployment/${project} ${appName}=${imageTag} -n ${namespaces}")
}
else {
def redeployFlag = '2'
}
I believe you're asking how to save the result of a shell command to a variable for later use?
The way to do this is to use some optional parameters available on the shell step interface. See https://jenkins.io/doc/pipeline/steps/workflow-durable-task-step/#sh-shell-script for the documentation
def projectFlag = sh(returnStdout: true,
script: "`kubectl get deployment -n ${namespace}| grep ${project} | wc -l`"
).trim()
Essentially set returnStdout to true. The .trim() is critical for ensuring you don't pickup a \n newline character which will ruin your evaluation logic.

Awk/sed script to convert a file from camelCase to underscores

I want to convert several files in a project from camelCase to underscore_case.
I would like to have a onliner that only needs the filename to work.
You could use sed also.
$ echo 'fooBar' | sed -r 's/([a-z0-9])([A-Z])/\1_\L\2/g'
foo_bar
$ echo 'fooBar' | sed 's/\([a-z0-9]\)\([A-Z]\)/\1_\L\2/g'
foo_bar
The proposed sed answer has some issues:
$ echo 'FooBarFB' | sed -r 's/([a-z0-9])([A-Z])/\1_\L\2/g'
Foo_bar_fB
I sugesst the following
$ echo 'FooBarFB' | sed -r 's/([A-Z])/_\L\1/g' | sed 's/^_//'
foo_bar_f_b
After a few unsuccessful tries, I got this (I wrote it on several lines for readability, but we can remove the newlines to have a onliner) :
awk -i inplace '{
while ( match($0, /(.*)([a-z0-9])([A-Z])(.*)/, cap))
$0 = cap[1] cap[2] "_" tolower(cap[3]) cap[4];
print
}' FILE
For the sake of completeness, we can adapt it to do the contrary (underscore to CamelCase) :
awk -i inplace '{
while ( match($0, /(.*)([a-z0-9])_([a-z])(.*)/, cap))
$0 = cap[1] cap[2] toupper(cap[3]) cap[4];
print
}' FILE
If you're wondering, the -i inplace is a flag only available with awk >=4.1.0, and it modify the file inplace (as with sed -i). If you're awk version is older, you have to do something like :
awk '{...}' FILE > FILE.tmp && mv FILE.tmp FILE
Hope it could help someone !
This might be what you want:
$ cat tst.awk
{
head = ""
tail = $0
while ( match(tail,/[[:upper:]]/) ) {
tgt = substr(tail,RSTART,1)
if ( substr(tail,RSTART-1,1) ~ /[[:lower:]]/ ) {
tgt = "_" tolower(tgt)
}
head = head substr(tail,1,RSTART-1) tgt
tail = substr(tail,RSTART+1)
}
print head tail
}
$ cat file
nowIs theWinterOfOur disContent
From ThePlay About RichardIII
$ awk -f tst.awk file
now_is the_winter_of_our dis_content
From The_play About Richard_iII
but without your sample input and expected output it's just a guess.
Here is a Python script that converts a file with CamelCase functions to snake_case, then fixes up callers as well. Optionally it creates a commit with the changes.
Usage:
style.py -s -c tools/patman/terminal.py
#!/usr/bin/env python3
# SPDX-License-Identifier: GPL-2.0+
#
# Copyright 2021 Google LLC
# Written by Simon Glass <sjg#chromium.org>
#
"""Changes the functions and class methods in a file to use snake case, updating
other tools which use them"""
from argparse import ArgumentParser
import glob
import os
import re
import subprocess
import camel_case
# Exclude functions with these names
EXCLUDE_NAMES = set(['setUp', 'tearDown', 'setUpClass', 'tearDownClass'])
# Find function definitions in a file
RE_FUNC = re.compile(r' *def (\w+)\(')
# Where to find files that might call the file being converted
FILES_GLOB = 'tools/**/*.py'
def collect_funcs(fname):
"""Collect a list of functions in a file
Args:
fname (str): Filename to read
Returns:
tuple:
str: contents of file
list of str: List of function names
"""
with open(fname, encoding='utf-8') as inf:
data = inf.read()
funcs = RE_FUNC.findall(data)
return data, funcs
def get_module_name(fname):
"""Convert a filename to a module name
Args:
fname (str): Filename to convert, e.g. 'tools/patman/command.py'
Returns:
tuple:
str: Full module name, e.g. 'patman.command'
str: Leaf module name, e.g. 'command'
str: Program name, e.g. 'patman'
"""
parts = os.path.splitext(fname)[0].split('/')[1:]
module_name = '.'.join(parts)
return module_name, parts[-1], parts[0]
def process_caller(data, conv, module_name, leaf):
"""Process a file that might call another module
This converts all the camel-case references in the provided file contents
with the corresponding snake-case references.
Args:
data (str): Contents of file to convert
conv (dict): Identifies to convert
key: Current name in camel case, e.g. 'DoIt'
value: New name in snake case, e.g. 'do_it'
module_name: Name of module as referenced by the file, e.g.
'patman.command'
leaf: Leaf module name, e.g. 'command'
Returns:
str: New file contents, or None if it was not modified
"""
total = 0
# Update any simple functions calls into the module
for name, new_name in conv.items():
newdata, count = re.subn(fr'{leaf}.{name}\(',
f'{leaf}.{new_name}(', data)
total += count
data = newdata
# Deal with files that import symbols individually
imports = re.findall(fr'from {module_name} import (.*)\n', data)
for item in imports:
#print('item', item)
names = [n.strip() for n in item.split(',')]
new_names = [conv.get(n) or n for n in names]
new_line = f"from {module_name} import {', '.join(new_names)}\n"
data = re.sub(fr'from {module_name} import (.*)\n', new_line, data)
for name in names:
new_name = conv.get(name)
if new_name:
newdata = re.sub(fr'\b{name}\(', f'{new_name}(', data)
data = newdata
# Deal with mocks like:
# unittest.mock.patch.object(module, 'Function', ...
for name, new_name in conv.items():
newdata, count = re.subn(fr"{leaf}, '{name}'",
f"{leaf}, '{new_name}'", data)
total += count
data = newdata
if total or imports:
return data
return None
def process_file(srcfile, do_write, commit):
"""Process a file to rename its camel-case functions
This renames the class methods and functions in a file so that they use
snake case. Then it updates other modules that call those functions.
Args:
srcfile (str): Filename to process
do_write (bool): True to write back to files, False to do a dry run
commit (bool): True to create a commit with the changes
"""
data, funcs = collect_funcs(srcfile)
module_name, leaf, prog = get_module_name(srcfile)
#print('module_name', module_name)
#print(len(funcs))
#print(funcs[0])
conv = {}
for name in funcs:
if name not in EXCLUDE_NAMES:
conv[name] = camel_case.to_snake(name)
# Convert name to new_name in the file
for name, new_name in conv.items():
#print(name, new_name)
# Don't match if it is preceeded by a '.', since that indicates that
# it is calling this same function name but in a different module
newdata = re.sub(fr'(?<!\.){name}\(', f'{new_name}(', data)
data = newdata
# But do allow self.xxx
newdata = re.sub(fr'self.{name}\(', f'self.{new_name}(', data)
data = newdata
if do_write:
with open(srcfile, 'w', encoding='utf-8') as out:
out.write(data)
# Now find all files which use these functions and update them
for fname in glob.glob(FILES_GLOB, recursive=True):
with open(fname, encoding='utf-8') as inf:
data = inf.read()
newdata = process_caller(fname, conv, module_name, leaf)
if do_write and newdata:
with open(fname, 'w', encoding='utf-8') as out:
out.write(newdata)
if commit:
subprocess.call(['git', 'add', '-u'])
subprocess.call([
'git', 'commit', '-s', '-m',
f'''{prog}: Convert camel case in {os.path.basename(srcfile)}
Convert this file to snake case and update all files which use it.
'''])
def main():
"""Main program"""
epilog = 'Convert camel case function names to snake in a file and callers'
parser = ArgumentParser(epilog=epilog)
parser.add_argument('-c', '--commit', action='store_true',
help='Add a commit with the changes')
parser.add_argument('-n', '--dry_run', action='store_true',
help='Dry run, do not write back to files')
parser.add_argument('-s', '--srcfile', type=str, required=True, help='Filename to convert')
args = parser.parse_args()
process_file(args.srcfile, not args.dry_run, args.commit)
if __name__ == '__main__':
main()

Resources