PIG Streaming: _some_ output files are missing - hadoop

The problem can be reproduced using a simple test.
The "pig" script is as follows:
SET pig.noSplitCombination true;
dataIn = LOAD 'input/Test';
DEFINE macro `TestScript` input('DummyInput.txt') output('A.csv', 'B.csv', 'C.csv', 'D.csv', 'E.csv') ship('TestScript');
dataOut = STREAM dataIn through macro;
STORE dataOut INTO 'output/Test';
The actual script is a complex R program but here is a simple "TestScript" that reproduces the problem and doesn't require R:
# Ignore the input coming from the 'DummyInput.txt' file
# For now just create some output data files
echo "File A" > A.csv
echo "File B" > B.csv
echo "File C" > C.csv
echo "File D" > D.csv
echo "File E" > E.csv
The input 'DummyInput.txt' is some dummy data for now.
Record1
Record2
Record3
For the test, I've load the the dummy data in HDFS using the following script. This will result in 200 input files.
for i in {0..199}
do
hadoop fs -put DummyInput.txt input/Test/Input$i.txt
done
When I run the pig job, it runs without errors. 200 mappers run as expected. However, I expect to see 200 files in the various HDFS directories. Instead I find that a number of the output files are missing:
1 200 1400 output/Test/B.csv
1 200 1400 output/Test/C.csv
1 189 1295 output/Test/D.csv
1 159 1078 output/Test/E.csv
The root "output/Test" has 200 files, which is correct. Folders "B.csv" and "C.csv" have 200 files as well. However, folders "D.csv" and "E.csv" have missing files.
We have looked at the logs but can't anything which points to why the local output files are not being copied from the data nodes to HDFS.

Related

baseDir issue with nextflow

This might be a very basic question for you guys, however, I am have just started with nextflow and I struggling with the simplest example.
I first explain what I have done and the problem.
Aim: I aim to make a workflow for my bioinformatics analyses as the one here (https://www.nextflow.io/example4.html)
Background: I have installed all the packages that were needed and they all work from the console without any error.
My run: I have used the same script as in example only by replacing the directory names. Here is how I have arranged the directories
location of script
~/raman/nflow/script.nf
location of Fastq files
~/raman/nflow/Data/T4_1.fq.gz
~/raman/nflow/Data/T4_2.fq.gz
Location of transcriptomic file
~/raman/nflow/Genome/trans.fa
The script
#!/usr/bin/env nextflow
/*
* The following pipeline parameters specify the refence genomes
* and read pairs and can be provided as command line options
*/
params.reads = "$baseDir/Data/T4_{1,2}.fq.gz"
params.transcriptome = "$baseDir/HumanGenome/SalmonIndex/gencode.v42.transcripts.fa"
params.outdir = "results"
workflow {
read_pairs_ch = channel.fromFilePairs( params.reads, checkIfExists: true )
INDEX(params.transcriptome)
FASTQC(read_pairs_ch)
QUANT(INDEX.out, read_pairs_ch)
}
process INDEX {
tag "$transcriptome.simpleName"
input:
path transcriptome
output:
path 'index'
script:
"""
salmon index --threads $task.cpus -t $transcriptome -i index
"""
}
process FASTQC {
tag "FASTQC on $sample_id"
publishDir params.outdir
input:
tuple val(sample_id), path(reads)
output:
path "fastqc_${sample_id}_logs"
script:
"""
fastqc "$sample_id" "$reads"
"""
}
process QUANT {
tag "$pair_id"
publishDir params.outdir
input:
path index
tuple val(pair_id), path(reads)
output:
path pair_id
script:
"""
salmon quant --threads $task.cpus --libType=U -i $index -1 ${reads[0]} -2 ${reads[1]} -o $pair_id
"""
}
Output:
(base) ntr#ser:~/raman/nflow$ nextflow script.nf
N E X T F L O W ~ version 22.10.1
Launching `script.nf` [modest_meninsky] DSL2 - revision: 032a643b56
executor > local (2)
executor > local (2)
[- ] process > INDEX (gencode) -
[28/02cde5] process > FASTQC (FASTQC on T4) [100%] 1 of 1, failed: 1 ✘
[- ] process > QUANT -
Error executing process > 'FASTQC (FASTQC on T4)'
Caused by:
Missing output file(s) `fastqc_T4_logs` expected by process `FASTQC (FASTQC on T4)`
Command executed:
fastqc "T4" "T4_1.fq.gz T4_2.fq.gz"
Command exit status:
0
Command output:
(empty)
Command error:
Skipping 'T4' which didn't exist, or couldn't be read
Skipping 'T4_1.fq.gz T4_2.fq.gz' which didn't exist, or couldn't be read
Work dir:
/home/ruby/raman/nflow/work/28/02cde5184f4accf9a05bc2ded29c50
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
I believe I have an issue with my baseDir understanding. I am assuming that the baseDir is the one where I have my file script.nf I am not sure what is going wrong and how can I fix it.
Could anyone please help or guide.
Thank you
Caused by:
Missing output file(s) `fastqc_T4_logs` expected by process `FASTQC (FASTQC on T4)`
Nextflow complains when it can't find the declared output files. This can occur even if the command completes successfully, i.e. with exit status 0. The problem here is that fastqc simply skips files that don't exist or can't be read (e.g. permissions problems), but it does produce these warnings:
Skipping 'T4' which didn't exist, or couldn't be read
Skipping 'T4_1.fq.gz T4_2.fq.gz' which didn't exist, or couldn't be read
The solution is to just make sure all files exist. Note that the fromFilePairs factory method produces a list of files in the second element. Therefore quoting a space-separated pair of filenames is also problematic. All you need is:
script:
"""
fastqc ${reads}
"""

Nextflow: Missing output file(s) expected by process

I'm currently making a start on using Nextflow to develop a bioinformatics pipeline. Below, I've created a params.files variable which contains my FASTQ files, and then input this into fasta_files channel.
The process trimming and its scripts takes this channel as the input, and then ideally, I would output all the $sample".trimmed.fq.gz into the output channel, trimmed_channel. However, when I run this script, I get the following error:
Missing output file(s) `trimmed_files` expected by process `trimming` (1)
The nextflow script I'm trying to run is:
#! /usr/bin/env nextflow
params.files = files("$baseDir/FASTQ/*.fastq.gz")
println "fastq files for trimming:$params.files"
fasta_files = Channel.fromPath(params.files)
println "files in the fasta channel: $fasta_files"
process trimming {
input:
file fasta_file from fasta_files
output:
path trimmed_files into trimmed_channel
// the shell script to be run:
"""
#!/usr/bin/env bash
mkdir trimming_report
cd /home/usr/Nextflow
#Finding and renaming my FASTQ files
for file in FASTQ/*.fastq.gz; do
[ -f "\$file" ] || continue
name=\$(echo "\$file" | awk -F'[/]' '{ print \$2 }') #renaming fastq files.
sample=\$(echo "\$name" | awk -F'[.]' '{ print \$1 }') #renaming fastq files.
echo "Found" "\$name" "from:" "\$sample"
if [ ! -e FASTQ/"\$sample"_trimmed.fq.gz ]; then
trim_galore -j 8 "\$file" -o FASTQ #trim the files
mv "\$file"_trimming_report.txt trimming_report #moves to the directory trimming report
else
echo ""\$sample".trimmed.fq.gz exists skipping trim galore"
fi
done
trimmed_files="FASTQ/*_trimmed.fq.gz"
echo \$trimmed_files
"""
}
The script in the process works fine. However, I'm wondering if I'm misunderstanding or missing something obvious. If I've forgot to include something, please let me know and any help is appreciated!
Nextflow does not export the variable trimmed_files to its own scope unless you tell it to do so using the env output qualifier, however doing it that way would not be very idiomatic.
Since you know the pattern of your output files ("FASTQ/*_trimmed.fq.gz"), simply pass that pattern as output:
path "FASTQ/*_trimmed.fq.gz" into trimmed_channel
Some things you do, but probably want to avoid:
Changing directory inside your NF process, don't do this, it entirely breaks the whole concept of nextflow's /work folder setup.
Write a bash loop inside a NF process, if you set up your channels correctly there should only be 1 task per spawned process.
Pallie has already provided some sound advice and, of course, the right answer, which is: environment variables must be declared using the env qualifier.
However, given your script definition, I think there might be some misunderstanding about how best to skip the execution of previously generated results. The cache directive is enabled by default and when the pipeline is launched with the -resume option, additional attempts to execute a process using the same set of inputs, will cause the process execution to be skipped and will produce the stored data as the actual results.
This example uses the Nextflow DSL 2 for my convenience, but is not strictly required:
nextflow.enable.dsl=2
params.fastq_files = "${baseDir}/FASTQ/*.fastq.gz"
params.publish_dir = "./results"
process trim_galore {
tag { "${sample}:${fastq_file}" }
publishDir "${params.publish_dir}/TrimGalore", saveAs: { fn ->
fn.endsWith('.txt') ? "trimming_reports/${fn}" : fn
}
cpus 8
input:
tuple val(sample), path(fastq_file)
output:
tuple val(sample), path('*_trimmed.fq.gz'), emit: trimmed_fastq_files
path "${fastq_file}_trimming_report.txt", emit: trimming_report
"""
trim_galore \\
-j ${task.cpus} \\
"${fastq_file}"
"""
}
workflow {
Channel.fromPath( params.fastq_files )
| map { tuple( it.getSimpleName(), it ) }
| set { sample_fastq_files }
results = trim_galore( sample_fastq_files )
results.trimmed_fastq_files.view()
}
Run using:
nextflow run script.nf \
-ansi-log false \
--fastq_files '/home/usr/Nextflow/FASTQ/*.fastq.gz'

Shell script: Copy file and folder N times

I've two documents:
an .json
an folder with random content
where <transaction> is id+sequancial (id1, id2... idn)
I'd like to populate this structure (.json + folder) to n. I mean:
I'd like to have id1.json and id1 folder, an id2.json and id2 folder... idn.json and idn folder.
Is there anyway (shell script) to populate this content?
It would be something like:
for (i=0,i<n,i++) {
copy "id" file to "id+i" file
copy "id" folder to "id+i" folder
}
Any ideas?
Your shell syntax is off but after that, this should be trivial.
#!/bin/bash
for((i=0;i<$1;i++)); do
cp "id".json "id$i".json
cp -r "id" "id$i"
done
This expects the value of n as the sole argument to the script (which is visible inside the script in $1).
The C-style for((...)) loop is Bash only, and will not work with sh.
A proper production script would also check that it received the expected parameter in the expected format (a single positive number) but you will probably want to tackle such complications when you learn more.
Additionaly, here is a version working with sh:
#!/bin/sh
test -e id.json || { (>&2 echo "id.json not found") ; exit 1 ; }
{
seq 1 "$1" 2> /dev/null ||
(>&2 echo "usage: $0 transaction-count") && exit 1
} |
while read i
do
cp "id".json "id$i".json
cp -r "id" "id$i"
done

A bash script to split a data file into many sub-files as per an index file using dd

I have a large data file that contains many joint files.
It has an separate index file has that file name, start + end byte of each file within the data file.
I'm needing help in creating a bash script to split the large file into it's 1000's of sub files.
Data File : fileafilebfilec etc
Index File:
filename.png<0>3049
folder\filename2.png<3049>6136.
I guess this needs to loop through each line of the index file, then using dd to extract the relevant bytes into a file. Maybe a fiddly part might be the folder structure bracket being windows style rather than linux style.
Any help much appreciated.
while read p; do
q=${p#*<}
startbyte=${q%>*}
endbyte=${q#*>}
filename=${p%<*}
count=$(($endbyte - $startbyte))
toprint="processing $filename startbyte: $startbyte endbyte: $endbyte count: $c$
echo $toprint
done <indexfile
Worked it out :-) FYI:
while read p; do
#sort out variables
q=${p#*<}
startbyte=${q%>*}
endbyte=${q#*>}
filename=${p%<*}
count=$(($endbyte - $startbyte))
#let it know we're working
toprint="processing $filename startbyte: $startbyte endbyte: $endbyte count: $c$
echo $toprint
if [[ $filename == *"/"* ]]; then
echo "have found /"
directory=${filename%/*}
#if no directory exists, create it
if [ ! -d "$directory" ]; then
# Control will enter here if $directory doesn't exist.
echo "directory not found - creating one"
mkdir ~/etg/$directory
fi
fi
dd skip=$startbyte count=$count if=~/etg/largefile of=~/etg/$filename bs=1
done <indexfile

batch processing : File name comparison error

I have written a program (Cifti_subject_fmri) which compares whether file name matches in two folders and essentially executes a set of instructions
#!/bin/bash -- fix_mni_paths
source activate ciftify_v1.0.0
export SUBJECTS_DIR=/scratch/m/mchakrav/dev/functional_data
export HCP_DATA=/scratch/m/mchakrav/dev/tCDS_ciftify
## make the $SUBJECTS_DIR if it does not already exist
mkdir -p ${HCP_DATA}
SUBJECTS=`cd $SUBJECTS_DIR; ls -1d *` ## list of my subjects
HCP=`cd $HCP_DATA; ls -1d *` ## List of HCP Subjects
cd $HCP_DATA
## submit the files to the queue
for i in $SUBJECTS;do
for j in $HCP ; do
if [[ $i == $j ]];then
parallel "echo ciftify_subject_fmri $i/filtered_func_data.nii.gz $j fMRI " ::: $SUBJECTS |qbatch --walltime '05:00:00' --ppj 8 -c 4 -j 4 -N ciftify_subject_fmri -
fi
done
done
When i run this code in the cluster i am getting an error which says
./Cifti_subject_fmri: [[AS1: command not found
The query ciftify_subject_fmri is part of toolbox ciftify, for it to execute it requires following instructions
ciftify_subject_fmri <func.nii.gz> <Subject> <NameOffMRI>
I have 33 subjects [AS1 -AS33] each with its own func.nii.gz files located SUBJECTS directory,the results need to be populated in HCP directory, fMRI is name of file format .
Could some one kindly let me know why i am getting an error in loop

Resources