How to download a list of `FastQ` files in `Nextflow` using `fromSRA` function? - bioinformatics

I have a tsv file with various columns. One of the columns of interest for me is the run_accession column. It contains accession id of various genome data samples. I want to write a pipeline in Nextflow which reads accession ids from this file using the following command:
cut -f4 datalist.tsv | sed -n 2,11p
Output:
ERR2512385
ERR2512386
ERR2512387
ERR2512388
ERR2512389
ERR2512390
ERR2512391
ERR2512392
ERR2512393
ERR2512394
and feed this list of IDs into Channel.fromSRA method. So far, I have tried this:
#!/home/someuser/bin nextflow
nextflow.enable.dsl=2
params.datalist = "$baseDir/datalist.tsv"
process fetchRunAccession {
input:
path dlist
output:
file accessions
"""
cut -f4 $dlist | sed -n 2,11p
"""
}
process displayResult {
input:
file accessions
output:
stdout
"""
echo "$accessions"
"""
}
workflow {
accessions_p = fetchRunAccession(params.datalist)
result = displayResult(accessions_p)
result.view { it }
}
And I get this error:
Error executing process > 'fetchRunAccession'
Caused by:
Missing output file(s) `accessions` expected by process `fetchRunAccession
If I run just the first process it works well and prints 10 lines as expected. The second process is just a placeholder for the actual fromSRA implementation but I have not been able to use the output of first process as the input of second. I am very new to Nextflow and my code probably has some silly mistakes. I would appreciate any help in this matter.

The fromSRA function is actually a factory method. It requires either a project/study id, or one or more accession numbers, which must be specified as a list. A channel emitting accession numbers (like in your example code) will not work here. Also, it would be better to avoid spawning a separate job/process just to parse a small CSV file. Instead, just let your main Nextflow process do this. There's lots of ways to do this, but for CSV input I find using Nextflow's CsvSplitter class makes this easy:
import nextflow.splitter.CsvSplitter
nextflow.enable.dsl=2
def fetchRunAccessions( tsv ) {
def splitter = new CsvSplitter().options( header:true, sep:'\t' )
def reader = new BufferedReader( new FileReader( tsv ) )
splitter.parseHeader( reader )
List<String> run_accessions = []
Map<String,String> row
while( row = splitter.fetchRecord( reader ) ) {
run_accessions.add( row['run_accession'] )
}
return run_accessions
}
workflow {
accessions = fetchRunAccessions( params.filereport )
Channel
.fromSRA( accessions )
.view()
}
Note that Nextflow's ENA download URL was updated recently. You'll need the latest version of Nextflow (21.07.0-edge) to get this to run easily:
NXF_VER=21.07.0-edge nextflow run test.nf --filereport filereport.tsv

Related

Trying to create bash script to run a script and input some usernames and return a new line

Creating a bash script that is running against a wiki page to check for contributions right now. The script currently requires you to enter each username 1 by 1 and then press enter and it runs against the entirety of the list.
I want to do something like this
#! /bin/bash
users="user1,user2,user3,user4,user5 etc"
echo $users \n | <script.py>
But I cant get it to return the new line and run automatically, it currently just enters the users but wont "press enter" for me, lazy I know but trying to learn how to script and this seemed like a good in.
Any help would be greatly appreciated, thanks!
EDIT:
Here is the portion of the python code that requires an input of aliases/usernames and then when the user (me) has entered them you press enter on the KB and the script runs
aliases_input = str(input("Enter comma separated alias(es): "))
aliases = aliases_input.split(",")
summary = []
try:
for alias in aliases:
df, total_activity_count = get_contributions(alias)
summary.append({"alias": alias, "totalContributions": total_activity_count})
summary_table = pd.DataFrame(data=summary)
print(
"Getting total contributions for:",
alias,
"\n",
df,
"\n",
"Total Contributions:",
total_activity_count,
"\n",
)
print("Summary of all aliases", "\n", summary_table)
except UnboundLocalError:
pass```
Let me reply with a new answer because your edit also changed the problem description. The current version requires the user inputs to be comma separated and this format is now explicitly exploited in the (newly posted) Python script. The general layout is similar to before.
(1) Create the bash script Script.sh to print the string with users to the screen.
#!/bin/bash
StrArray="user1, user2, user3"
echo "$StrArray"
The output which is printed to the screen can subsequently serve as the input for the Python script ReadStdin.py.
(2) My content of ReadStdin.py is given below. Some short remarks are:
I just created some random output in get_contributions in order to get df and total_activity_count.
For brevity, I omitted the try-except statements.
Your posted code creates the pandas dataframe inside the loop. This seems inefficient as it can be done once after all data has been appended.
Run ./Script.sh | ./ReadStdin.py
#!/usr/local/bin/python3
import pandas as pd
import random
import sys
# Contribution function
def get_contributions(alias):
df = random.randrange(10)
total_activity_count = random.randrange(10,15)
return df, total_activity_count
aliases_input = sys.stdin.readlines()[0] # String in first element of list
aliases = aliases_input.strip().split(",")
summary = []
for alias in aliases:
df, total_activity_count = get_contributions(alias)
summary.append({"alias": alias, "totalContributions": total_activity_count})
print(
"Getting total contributions for:",
alias,
"\n",
df,
"\n",
"Total Contributions:",
total_activity_count,
"\n",
)
summary_table = pd.DataFrame(data=summary)
print("Summary of all aliases", "\n", summary_table)
I created the bash script
#!/bin/bash
StrArray=("user1" "user2" "user3")
for val in ${StrArray[#]}; do
echo "$val"
done
and the Python code
#!/usr/local/bin/python3
import sys
data = sys.stdin.readlines()
print("--- PYTHON STRINGS ---")
print(data)
If I run ./Script.sh | ./ReadStdin.py, then the content of data is ['user1\n', 'user2\n', 'user3\n'] for me. That is, the newline character is indeed already included (as remarked by #Erwin). I'm not sure whether this answers your question but this is the best I can do without having any further details on <script.py>.

How to use entrezpy and Biopython Entrez libraries to access ClinVar data from genomic position of variant

[Disclaimer: I have published this question 3 weeks ago in biostars, with no answers yet. I really would like to get some ideas/discussion to find a solution, so I post also here.
biostars post link: https://www.biostars.org/p/447413/]
For one of my projects of my PhD, I would like to access all variants, found in ClinVar db, that are in the same genomic position as the variant in each row of the input GSVar file. The language constraint is Python.
Up to now I have used entrezpy module: entrezpy.esearch.esearcher. Please see more for entrezpy at: https://entrezpy.readthedocs.io/en/master/
From the entrezpy docs I have followed this guide to access UIDs using the genomic position of a variant: https://entrezpy.readthedocs.io/en/master/tutorials/esearch/esearch_uids.html in code:
# first get UIDs for clinvar records of the same position
# credits: credits: https://entrezpy.readthedocs.io/en/master/tutorials/esearch/esearch_uids.html
chr = variants["chr"].split("chr")[1]
start, end = str(variants["start"]), str(variants["end"])
es = entrezpy.esearch.esearcher.Esearcher('esearcher', self.entrez_email)
genomic_pos = chr + "[chr]" + " AND " + start + ":" + end # + "[chrpos37]"
entrez_query = es.inquire(
{'db': 'clinvar',
'term': genomic_pos,
'retmax': 100000,
'retstart': 0,
'rettype': 'uilist'}) # 'usehistory': False
entrez_uids = entrez_query.get_result().uids
Then I have used Entrez from BioPython to get the available ClinVar records:
# process each VariationArchive of each UID
handle = Entrez.efetch(db='clinvar', id=current_entrez_uids, rettype='vcv')
clinvar_records = {}
tree = ET.parse(handle)
root = tree.getroot()
This approach is working. However, I have two main drawbacks:
entrezpy fulls up my log file recording all interaction with Entrez making the log file too big to be read by the hospital collaborator, who is variant curator.
entrezpy function, entrez_query.get_result().uids, will return all UIDs retrieved so far from all the requests (say a request for each variant in GSvar), thus this space inefficient retrieval. That is the entrez_uids list will quickly grow a lot as I process all variants from a GSVar file. The simple solution that I have implenented is to check which UIDs are new from the current request and then keep only those for Entrez.fetch(). However, I still need to keep all seen UIDs, from previous variants in order to be able to know which is the new UIDs. I do this in code by:
# first snippet's first lines go here
entrez_uids = entrez_query.get_result().uids
current_entrez_uids = [uid for uid in entrez_uids if uid not in self.all_entrez_uids_gsvar_file]
self.all_entrez_uids_gsvar_file += current_entrez_uids
Does anyone have suggestion(s) on how to address these two presented drawbacks?

Using SystemCommandTasklet to split file

I want to run System Commands via SystemCommandTasklet.Itried this with the sample code below but I get an error.
I think this because of command parameter,But I could not fix it.
I would be very glad if it will help.
Reference Examples ;
Using SystemCommandTasklet for split the large flat file into small files
Trying to split files using SystemCommandTasklet - Execution of system command did not finish within the timeout
Error Detail ;
"CreateProcess error=2, The system cannot find the file specified"
Code Sample ;
#Bean
#StepScope
public SystemCommandTasklet fileSplitterSystemCommandTasklet(#Value("#{jobParameters['file']}") File file) throws Exception {
final String fileSeparator = System.getProperty("file.separator");
String outputDirectory = file.getPath().substring(0, file.getPath().lastIndexOf(fileSeparator)) + fileSeparator + "out" + fileSeparator;
File output = new File(outputDirectory);
if (!output.exists()) {
output.mkdir();
}
final String command = String.format("split -a 5 -l 10000 %s %s",file.getName(),outputDirectory);
var fileSplitterTasklet = new SystemCommandTasklet();
fileSplitterTasklet.setCommand(command);
fileSplitterTasklet.setTimeout(60000L);
fileSplitterTasklet.setWorkingDirectory(outputDirectory);
fileSplitterTasklet.setTaskExecutor(new SimpleAsyncTaskExecutor());
fileSplitterTasklet.setSystemProcessExitCodeMapper(touchCodeMapper());
fileSplitterTasklet.afterPropertiesSet();
fileSplitterTasklet.setInterruptOnCancel(true);
fileSplitterTasklet.setEnvironmentParams(new String[]{
"JAVA_HOME=/java",
"BATCH_HOME=/Users/batch"});
return fileSplitterTasklet;
}
You need to use file.getAbsolutePath() instead of file.getPath().
Also, you are using file.getName() in the command:
final String command = String.format("split -a 5 -l 10000 %s %s",file.getName(),outputDirectory);
You should pass the absolute path of the file or make sure to set
the working directory correctly so that the split command is executed
in the same directory as the file.

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
For example, I have thousands of pdf invoices and I want to read data from those and perform some analytics on that. What steps must I do to process unstructured data?
Yes, it is. Use sparkContext.binaryFiles to load files in binary format and then use map to map value to some other format - for example, parse binary with Apache Tika or Apache POI.
Pseudocode:
val rawFile = sparkContext.binaryFiles(...
val ready = rawFile.map ( here parsing with other framework
What is important, parsing must be done with other framework like mentioned previously in my answer. Map will get InputStream as an argument
We had a scenario where we needed to use a custom decryption algorithm on the input files. We didn't want to rewrite that code in Scala or Python. Python-Spark code follows:
from pyspark import SparkContext, SparkConf, HiveContext, AccumulatorParam
def decryptUncompressAndParseFile(filePathAndContents):
'''each line of the file becomes an RDD record'''
global acc_errCount, acc_errLog
proc = subprocess.Popen(['custom_decrypt_program','--decrypt'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(unzippedData, err) = proc.communicate(input=filePathAndContents[1])
if len(err) > 0: # problem reading the file
acc_errCount.add(1)
acc_errLog.add('Error: '+str(err)+' in file: '+filePathAndContents[0]+
', on host: '+ socket.gethostname()+' return code:'+str(returnCode))
return [] # this is okay with flatMap
records = list()
iterLines = iter(unzippedData.splitlines())
for line in iterLines:
#sys.stderr.write('Line: '+str(line)+'\n')
values = [x.strip() for x in line.split('|')]
...
records.append( (... extract data as appropriate from values into this tuple ...) )
return records
class StringAccumulator(AccumulatorParam):
''' custom accumulator to holds strings '''
def zero(self,initValue=""):
return initValue
def addInPlace(self,str1,str2):
return str1.strip()+'\n'+str2.strip()
def main():
...
global acc_errCount, acc_errLog
acc_errCount = sc.accumulator(0)
acc_errLog = sc.accumulator('',StringAccumulator())
binaryFileTup = sc.binaryFiles(args.inputDir)
# use flatMap instead of map, to handle corrupt files
linesRdd = binaryFileTup.flatMap(decryptUncompressAndParseFile, True)
df = sqlContext.createDataFrame(linesRdd, ourSchema())
df.registerTempTable("dataTable")
...
The custom string accumulator was very useful in identifying corrupt input files.

Strict searching against two different files

I have two questions regarding the following code:
import subprocess
macSource1 = (r"\\Server\path\name\here\dhcp-dump.txt")
macSource2 = (r"\\Server\path\name\here\dhcp-dump-ops.txt")
with open (r"specific-pcs.txt") as file:
line = []
for line in file:
pcName = line.strip().upper()
with open (macSource1) as source1, open (macSource2) as source2:
items = []
for items in source1:
if pcName in items:
items_split = items.rstrip("\n").split('\t')
ip = items_split[0]
mac = items_split[4]
mac2 = ':'.join(s.encode('hex') for s in mac.decode('hex')).lower() # Puts the :'s between the pairs.
print mac2
print pcName
print ip
Firstly, as you can see, the script is searching for the contents of "specific-pcs.txt" against the contents of macSource1 to get various details. How do I get it to search against BOTH macSource1 & 2 (as the details could be in either file)??
And secondly, I need to have a stricter matching process as at the moment a machine called 'itroom02' will not only find it's own details, but also provide the details for another machine called '2nd-itroom02'. How would I get that?
Many thanks for your assistance in advance!
Chris.
Perhaps you should restructure it a bit more like this:
macSources = [ r"\\Server\path\name\here\dhcp-dump.txt",
r"\\Server\path\name\here\dhcp-dump-ops.txt" ]
with open (r"specific-pcs.txt") as file:
for line in file:
# ....
for target in macSources:
with open (target) as source:
for items in source:
# ....
There's no need to do e.g. line = [] immediately before you do for line in ...:.
As far as the "stricter matching" goes, since you don't give examples of the format of your files, I can only guess - but you might want to try something like if items_split[1] == pcName: after you've done the split, instead of the if pcName in items: before you split (assuming the name is in the second column - adjust accordingly if not).

Resources