Figuring out pipes in python - bash

i am currently writing a program in python and i am stuck. So my questtion is:
I have a program that reads a file and prints some lines to stdout like this:
#imports
import sys
#number of args
numArgs = len(sys.argv)
#ERROR if not enough args were committed
if numArgs <= 1:
sys.exit("Not enough arguments!")
#naming input file from args
Input = sys.argv[1]
#opening files
try:
fastQ = open(Input , 'r')
except IOError, e:
sys.exit(e)
#parsing through file
while 1:
#saving the lines
firstL = fastQ.readline()
secondL = fastQ.readline()
#you could maybe skip these lines to save ram
fastQ.readline()
fastQ.readline()
#make sure that there are no blank lines in the file
if firstL == "" or secondL == "":
break
#edit the Header to begin with '>'
firstL = '>' + firstL.replace('#' , '')
sys.stdout.write(firstL)
sys.stdout.write(secondL)
#close both files
fastQ.close()
Now i want to rewrite this program so that i can run a command line like : zcat "textfile" | python "myprogram" > "otherfile". So i looked around and found subprocess, but can't seem to figure out how to do it. thanks for your help
EDIT:
Now, if what you are doing is trying to write a Python script to orchestrate the execution of both zcat and myprogram, THEN you may need subprocess. – rchang
The intend is to have the "textfile" and the program on a cluster, so i dont need to copy any files from the cluster. i just want to login on the cluster and use the command:zcat "textfile" | python "myprogram" > "otherfile", so that the zcat and the program do their thing and i end up with "otherfile" on the cluster. hope you understand what i want to do.
Edit #2:
my solution
#imports
import sys
import fileinput
# Counter, maybe there is a better way
count = 0
# Iterieration over Input
for line in fileinput.input():
# Selection of Header
if count == 0 :
#Format the Header
newL = '>' + line.replace('#' , '')
# Print the Header without newline
sys.stdout.write(newL)
# Selection of Sequence
elif count == 1 :
# Print the Sequence
sys.stdout.write(line)
# Up's the Counter
count += 1
count = count % 4
THX

You could use fastQ = sys.stdin to read the input from stdin instead of the file or (more generally) fastQ = fileinput.input() to read from stdin and/or files specified on the command-line.
There is also fileinput.hook_compressed so that you don't need zcat and read the compressed file directly instead:
$ myprogram textfile >otherfile

Related

Nextflow: Missing output file(s) expected by process

I'm currently making a start on using Nextflow to develop a bioinformatics pipeline. Below, I've created a params.files variable which contains my FASTQ files, and then input this into fasta_files channel.
The process trimming and its scripts takes this channel as the input, and then ideally, I would output all the $sample".trimmed.fq.gz into the output channel, trimmed_channel. However, when I run this script, I get the following error:
Missing output file(s) `trimmed_files` expected by process `trimming` (1)
The nextflow script I'm trying to run is:
#! /usr/bin/env nextflow
params.files = files("$baseDir/FASTQ/*.fastq.gz")
println "fastq files for trimming:$params.files"
fasta_files = Channel.fromPath(params.files)
println "files in the fasta channel: $fasta_files"
process trimming {
input:
file fasta_file from fasta_files
output:
path trimmed_files into trimmed_channel
// the shell script to be run:
"""
#!/usr/bin/env bash
mkdir trimming_report
cd /home/usr/Nextflow
#Finding and renaming my FASTQ files
for file in FASTQ/*.fastq.gz; do
[ -f "\$file" ] || continue
name=\$(echo "\$file" | awk -F'[/]' '{ print \$2 }') #renaming fastq files.
sample=\$(echo "\$name" | awk -F'[.]' '{ print \$1 }') #renaming fastq files.
echo "Found" "\$name" "from:" "\$sample"
if [ ! -e FASTQ/"\$sample"_trimmed.fq.gz ]; then
trim_galore -j 8 "\$file" -o FASTQ #trim the files
mv "\$file"_trimming_report.txt trimming_report #moves to the directory trimming report
else
echo ""\$sample".trimmed.fq.gz exists skipping trim galore"
fi
done
trimmed_files="FASTQ/*_trimmed.fq.gz"
echo \$trimmed_files
"""
}
The script in the process works fine. However, I'm wondering if I'm misunderstanding or missing something obvious. If I've forgot to include something, please let me know and any help is appreciated!
Nextflow does not export the variable trimmed_files to its own scope unless you tell it to do so using the env output qualifier, however doing it that way would not be very idiomatic.
Since you know the pattern of your output files ("FASTQ/*_trimmed.fq.gz"), simply pass that pattern as output:
path "FASTQ/*_trimmed.fq.gz" into trimmed_channel
Some things you do, but probably want to avoid:
Changing directory inside your NF process, don't do this, it entirely breaks the whole concept of nextflow's /work folder setup.
Write a bash loop inside a NF process, if you set up your channels correctly there should only be 1 task per spawned process.
Pallie has already provided some sound advice and, of course, the right answer, which is: environment variables must be declared using the env qualifier.
However, given your script definition, I think there might be some misunderstanding about how best to skip the execution of previously generated results. The cache directive is enabled by default and when the pipeline is launched with the -resume option, additional attempts to execute a process using the same set of inputs, will cause the process execution to be skipped and will produce the stored data as the actual results.
This example uses the Nextflow DSL 2 for my convenience, but is not strictly required:
nextflow.enable.dsl=2
params.fastq_files = "${baseDir}/FASTQ/*.fastq.gz"
params.publish_dir = "./results"
process trim_galore {
tag { "${sample}:${fastq_file}" }
publishDir "${params.publish_dir}/TrimGalore", saveAs: { fn ->
fn.endsWith('.txt') ? "trimming_reports/${fn}" : fn
}
cpus 8
input:
tuple val(sample), path(fastq_file)
output:
tuple val(sample), path('*_trimmed.fq.gz'), emit: trimmed_fastq_files
path "${fastq_file}_trimming_report.txt", emit: trimming_report
"""
trim_galore \\
-j ${task.cpus} \\
"${fastq_file}"
"""
}
workflow {
Channel.fromPath( params.fastq_files )
| map { tuple( it.getSimpleName(), it ) }
| set { sample_fastq_files }
results = trim_galore( sample_fastq_files )
results.trimmed_fastq_files.view()
}
Run using:
nextflow run script.nf \
-ansi-log false \
--fastq_files '/home/usr/Nextflow/FASTQ/*.fastq.gz'

To run "pylint" tool on multiple Python files on Windows Machine

I want to run "pylint" tool on multiple python files present under one directory.
I want one consolidated report for all the Python files.
I am able to run individually one python file, but want to run on bunch of files.
Please help with the command for the same.
I'm not a windows user, but isn't "pylint directory/*.py" enough ?
If the directory is a package (in the PYTHONPATH), you may also run "pylint directory"
Someone wrote a wrapper in python 2 to handle this
The code :
#! /usr/bin/env python
'''
Module that runs pylint on all python scripts found in a directory tree..
'''
import os
import re
import sys
total = 0.0
count = 0
def check(module):
'''
apply pylint to the file specified if it is a *.py file
'''
global total, count
if module[-3:] == ".py":
print "CHECKING ", module
pout = os.popen('pylint %s'% module, 'r')
for line in pout:
if re.match("E....:.", line):
print line
if "Your code has been rated at" in line:
print line
score = re.findall("\d+.\d\d", line)[0]
total += float(score)
count += 1
if __name__ == "__main__":
try:
print sys.argv
BASE_DIRECTORY = sys.argv[1]
except IndexError:
print "no directory specified, defaulting to current working directory"
BASE_DIRECTORY = os.getcwd()
print "looking for *.py scripts in subdirectories of ", BASE_DIRECTORY
for root, dirs, files in os.walk(BASE_DIRECTORY):
for name in files:
filepath = os.path.join(root, name)
check(filepath)
print "==" * 50
print "%d modules found"% count
print "AVERAGE SCORE = %.02f"% (total / count)

Awk/sed script to convert a file from camelCase to underscores

I want to convert several files in a project from camelCase to underscore_case.
I would like to have a onliner that only needs the filename to work.
You could use sed also.
$ echo 'fooBar' | sed -r 's/([a-z0-9])([A-Z])/\1_\L\2/g'
foo_bar
$ echo 'fooBar' | sed 's/\([a-z0-9]\)\([A-Z]\)/\1_\L\2/g'
foo_bar
The proposed sed answer has some issues:
$ echo 'FooBarFB' | sed -r 's/([a-z0-9])([A-Z])/\1_\L\2/g'
Foo_bar_fB
I sugesst the following
$ echo 'FooBarFB' | sed -r 's/([A-Z])/_\L\1/g' | sed 's/^_//'
foo_bar_f_b
After a few unsuccessful tries, I got this (I wrote it on several lines for readability, but we can remove the newlines to have a onliner) :
awk -i inplace '{
while ( match($0, /(.*)([a-z0-9])([A-Z])(.*)/, cap))
$0 = cap[1] cap[2] "_" tolower(cap[3]) cap[4];
print
}' FILE
For the sake of completeness, we can adapt it to do the contrary (underscore to CamelCase) :
awk -i inplace '{
while ( match($0, /(.*)([a-z0-9])_([a-z])(.*)/, cap))
$0 = cap[1] cap[2] toupper(cap[3]) cap[4];
print
}' FILE
If you're wondering, the -i inplace is a flag only available with awk >=4.1.0, and it modify the file inplace (as with sed -i). If you're awk version is older, you have to do something like :
awk '{...}' FILE > FILE.tmp && mv FILE.tmp FILE
Hope it could help someone !
This might be what you want:
$ cat tst.awk
{
head = ""
tail = $0
while ( match(tail,/[[:upper:]]/) ) {
tgt = substr(tail,RSTART,1)
if ( substr(tail,RSTART-1,1) ~ /[[:lower:]]/ ) {
tgt = "_" tolower(tgt)
}
head = head substr(tail,1,RSTART-1) tgt
tail = substr(tail,RSTART+1)
}
print head tail
}
$ cat file
nowIs theWinterOfOur disContent
From ThePlay About RichardIII
$ awk -f tst.awk file
now_is the_winter_of_our dis_content
From The_play About Richard_iII
but without your sample input and expected output it's just a guess.
Here is a Python script that converts a file with CamelCase functions to snake_case, then fixes up callers as well. Optionally it creates a commit with the changes.
Usage:
style.py -s -c tools/patman/terminal.py
#!/usr/bin/env python3
# SPDX-License-Identifier: GPL-2.0+
#
# Copyright 2021 Google LLC
# Written by Simon Glass <sjg#chromium.org>
#
"""Changes the functions and class methods in a file to use snake case, updating
other tools which use them"""
from argparse import ArgumentParser
import glob
import os
import re
import subprocess
import camel_case
# Exclude functions with these names
EXCLUDE_NAMES = set(['setUp', 'tearDown', 'setUpClass', 'tearDownClass'])
# Find function definitions in a file
RE_FUNC = re.compile(r' *def (\w+)\(')
# Where to find files that might call the file being converted
FILES_GLOB = 'tools/**/*.py'
def collect_funcs(fname):
"""Collect a list of functions in a file
Args:
fname (str): Filename to read
Returns:
tuple:
str: contents of file
list of str: List of function names
"""
with open(fname, encoding='utf-8') as inf:
data = inf.read()
funcs = RE_FUNC.findall(data)
return data, funcs
def get_module_name(fname):
"""Convert a filename to a module name
Args:
fname (str): Filename to convert, e.g. 'tools/patman/command.py'
Returns:
tuple:
str: Full module name, e.g. 'patman.command'
str: Leaf module name, e.g. 'command'
str: Program name, e.g. 'patman'
"""
parts = os.path.splitext(fname)[0].split('/')[1:]
module_name = '.'.join(parts)
return module_name, parts[-1], parts[0]
def process_caller(data, conv, module_name, leaf):
"""Process a file that might call another module
This converts all the camel-case references in the provided file contents
with the corresponding snake-case references.
Args:
data (str): Contents of file to convert
conv (dict): Identifies to convert
key: Current name in camel case, e.g. 'DoIt'
value: New name in snake case, e.g. 'do_it'
module_name: Name of module as referenced by the file, e.g.
'patman.command'
leaf: Leaf module name, e.g. 'command'
Returns:
str: New file contents, or None if it was not modified
"""
total = 0
# Update any simple functions calls into the module
for name, new_name in conv.items():
newdata, count = re.subn(fr'{leaf}.{name}\(',
f'{leaf}.{new_name}(', data)
total += count
data = newdata
# Deal with files that import symbols individually
imports = re.findall(fr'from {module_name} import (.*)\n', data)
for item in imports:
#print('item', item)
names = [n.strip() for n in item.split(',')]
new_names = [conv.get(n) or n for n in names]
new_line = f"from {module_name} import {', '.join(new_names)}\n"
data = re.sub(fr'from {module_name} import (.*)\n', new_line, data)
for name in names:
new_name = conv.get(name)
if new_name:
newdata = re.sub(fr'\b{name}\(', f'{new_name}(', data)
data = newdata
# Deal with mocks like:
# unittest.mock.patch.object(module, 'Function', ...
for name, new_name in conv.items():
newdata, count = re.subn(fr"{leaf}, '{name}'",
f"{leaf}, '{new_name}'", data)
total += count
data = newdata
if total or imports:
return data
return None
def process_file(srcfile, do_write, commit):
"""Process a file to rename its camel-case functions
This renames the class methods and functions in a file so that they use
snake case. Then it updates other modules that call those functions.
Args:
srcfile (str): Filename to process
do_write (bool): True to write back to files, False to do a dry run
commit (bool): True to create a commit with the changes
"""
data, funcs = collect_funcs(srcfile)
module_name, leaf, prog = get_module_name(srcfile)
#print('module_name', module_name)
#print(len(funcs))
#print(funcs[0])
conv = {}
for name in funcs:
if name not in EXCLUDE_NAMES:
conv[name] = camel_case.to_snake(name)
# Convert name to new_name in the file
for name, new_name in conv.items():
#print(name, new_name)
# Don't match if it is preceeded by a '.', since that indicates that
# it is calling this same function name but in a different module
newdata = re.sub(fr'(?<!\.){name}\(', f'{new_name}(', data)
data = newdata
# But do allow self.xxx
newdata = re.sub(fr'self.{name}\(', f'self.{new_name}(', data)
data = newdata
if do_write:
with open(srcfile, 'w', encoding='utf-8') as out:
out.write(data)
# Now find all files which use these functions and update them
for fname in glob.glob(FILES_GLOB, recursive=True):
with open(fname, encoding='utf-8') as inf:
data = inf.read()
newdata = process_caller(fname, conv, module_name, leaf)
if do_write and newdata:
with open(fname, 'w', encoding='utf-8') as out:
out.write(newdata)
if commit:
subprocess.call(['git', 'add', '-u'])
subprocess.call([
'git', 'commit', '-s', '-m',
f'''{prog}: Convert camel case in {os.path.basename(srcfile)}
Convert this file to snake case and update all files which use it.
'''])
def main():
"""Main program"""
epilog = 'Convert camel case function names to snake in a file and callers'
parser = ArgumentParser(epilog=epilog)
parser.add_argument('-c', '--commit', action='store_true',
help='Add a commit with the changes')
parser.add_argument('-n', '--dry_run', action='store_true',
help='Dry run, do not write back to files')
parser.add_argument('-s', '--srcfile', type=str, required=True, help='Filename to convert')
args = parser.parse_args()
process_file(args.srcfile, not args.dry_run, args.commit)
if __name__ == '__main__':
main()

Python: will not read a certain file in a for loop

I have a directory containing files and they are all processed except one, file2.txt with my the_script.py script.
Independanty i ran a simple for line in file2.txt: print line and it worked just fine. The lines were printed. So the file is not the problem, it is formatted just as the other ones (automatically, output of another script).
Here is the_script.py :
#!/usr/bin/python
import os
import glob
#[...]rest of the code not dealing with the files in questions
for filename in glob.glob("outdir/*_mapp"): #i need to get all the files in outdir/ directory with the *_mapp extension
infilemapp=open(filename)
print "start"
print infilemapp #test, priting all filenames
organism=(filename.split("/", 1)[1])[:-5] # outdir/acorus.txt_mapp --> acorus.txt IRRELEVANT PARSING LINE
infilelpwe=organism+"_lpwe" #acorus.txt --> acorus.txt_lpwe IRRELEVANT PARSING LINE
for line in infilemapp:
print line
print "end"
What i expected is to get, for ALL files, "start, filename, filecontent, end". I get in console:
bash-4.3$ ./the_script.py
start
<open file 'outdir/file1.txt_mapp', mode 'r' at 0x7fb5795ec930>
['3R', '2F', '0R', '3F', '1R', '4F', '1F']
end
start
<open file 'outdir/file3.txt_mapp', mode 'r' at 0x7fb5795eca50>
['0R', '5R', '7R', '4R', '1F', '6R', '2R', '6F', '1R', '4F', '7F', '5F', '0F', '3R']
end
start
<open file 'outdir/file2.txt_mapp', mode 'r' at 0x7fb5795ec930>
end
As you can see, nothing is printed for file2.txt_mapp.
bash-4.3$ cat outdir/file2.txt_mapp
['5F', '0F', '2F', '6F', '3R', '5R', '6R', '4F', '1R', '4R', '6F']
The file is alphabetically in the middle of all files. Why does my script not work for this specific one? Please if you have any suggestions...

Using R with in command bash terminal

I have a set of files *.txt in a specific directory. I have written an .r file code called SampleStatus.r which contains a unique function that reads, proceeses data and writes the results to an output file.
The function is like:
format_windpro(import_file="in.txt", export_file="out.txt")
I would like to use bash commands to read and compute every file in one command using my R file.
Use Rscript. Example code:
for f in ${INPUT_DIR}/*.txt; do \
base=$(basename $f) \
Rscript SampleStatus.R $f ${OUTPUT_DIR}/$base \
done
While in your SampleStatus.R you handle command line arguments like this:
#!/usr/bin/env Rscript
# ...
argv <- commandArgs(T)
# error checking...
import_file <- argv[1]
export_file <- argv[2]
# your function call
format_windpro(import_file, export_file)

Resources