Fastest way to cat from multiple files to multiple files - bash

Using bash (and any standard tools), what is the quickest way to cat a large amount of files?
I have multiple folders containing multiple files of genetic data.
Each folder name is the species name.
Each filename starts with the query name, followed by species name.
Each file contains two lines, first line is the sequence ID, and the second line is genetic data.
I.e. folder name:
speciesName
E.g. folder names:
SolanumCommersonii
SolanumLycopersicum
I.e. file name:
queryName_speciesName
E.g. file names:
Q12P34_SolanumCommersonnii.fasta
Q56P78_SolanumCommersonnii.fasta
Q12P34_SolanumLycopersicum.fasta
Example file contents:
>sequenceID blah blah blah
acgtagctagctagtcgatgctagcggctatatgcgatctagtca
There are about 45,000 files per species and about 10 species.
I want to iterate through each file and cat it to a file that is for that specific query name so that all the different species sequence IDs and genetic data is in a file for that specific query.
I've managed to write a bash script to do so, but it feels like I could make this process faster. Currently it's taking a couple of hours to complete using the following:
readarray -t folders < speciesList.txt
mkdir -p joined
for folder in ${folders[#]}
do
foldCont=($(ls "$folder"))
for fileN in ${foldCont[#]}
do
cat "$folder"/"$fileN" >> joined/"${fileN%_"${folder}"*}"_allCords.fna
done
echo "finished with ""$folder"
done
E.g speciesList.txt:
SolanumCommersonnii
SolanumLycopersicum
Any help to make this faster would be appreciated. I've thought about maybe using xargs, but not sure if that would be any quicker.
EDIT:
E.g. file contents (filename shown here also):
Q12P34_SolanumCommersonnii.fasta
> Q12P34_Solanum_commersonii cultivar 41
tcaacgtagctagctagtcgatgctagcgaggctatatgcgatctagtca
Q56P78_SolanumCommersonnii.fasta
> Q56P78_Solanum_commersonii cultivar 49
tcgatgctagcgaggctatatgcgatctagtcagactaaata
Q12P34_SolanumLycopersicum.fasta
> Q12P34_SolanumLycopersicum cultivar 98
tgctagcgaggctatatgcgatctagtcgaagaagaattataaa
E.g. expected output files:
joined/Q12P34_allCords.fna
> Q12P34_Solanum_commersonii cultivar 41
tcaacgtagctagctagtcgatgctagcgaggctatatgcgatctagtca
> Q12P34_SolanumLycopersicum cultivar 98
tgctagcgaggctatatgcgatctagtcgaagaagaattataaa
joined/Q56P78_allCords.fna
> Q56P78_Solanum_commersonii cultivar 49
tcgatgctagcgaggctatatgcgatctagtcagactaaata

Related

for loop concatenating files that share part of their common basename (paired end sequencing reads)

I'm trying to concatenate a bunch of paired files into one file (for those who work with sequencing data, you'll be familiar with the paired-end read format).
For example, I have
SLG1.R1.fastq.gz
SLG1.R2.fastq.gz
SLG2.R1.fastq.gz
SLG2.R2.fastq.gz
SLG3.R1.fastq.gz
SLG3.R2.fastq.gz
etc.
I need to concatenate the two SLG1 files, the two SLG2 files, and the two SLG3 files.
So far I have this:
cd /workidr/slg/diet_manip/filtered_concatenated_reads/nonhost
for i in *1.fastq.gz
do
base=(basename $i "1.fastq.gz")
cat ${base}1.fastq.gz ${base}2.fastq.gz > /workdir/slg/diet_manip/filtered_concatenated_reads/cat/${base}.fastq.gz
done
The original files are all in the /filtered_concatenated_reads/nonhost directory, and I want the concatenated versions to be in /filtered_concatenated_reads/cat
The above code gives this error:
-bash: /workdir/slg/diet_manip/filtered_concatenated_reads/cat/basename.fastq.gz: No such file or directory
Any ideas?
Thank you!!

How to loop over multiple folders to concatenate FastQ files?

I have received multiple fastq.gz files from Illumina Sequencing for 100 samples. But all the fastq.gz files for the respective samples are in separate folders according to the sample ID. Moreover, I have multiple (8-16) R1.fastq.gz and R2.fastq.gz files for one sample. So, I used the following code for concatenating all the R1.fastq.gz and R2.fastq.gz into a single R1.fastq.gz and R2.fastq.gz.
cat V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz > sample_R1.fq.gz
So in the sequencing file, the structure is like the above in the code. For each sample, the string with V has different number then L with different number and then another string of digits before the _1 and _2. For each sample, the numbers keep changing.
My questing is, how can I create a loop that will go over all the folders at once taking the different file numbering of sequence files into consideration for concatenating the multiple fq.gz files and combine them into a single R1 and R2 file?
Surely, I cannot just concatenate one by one by going into each sample folder.
Please give some helpful tips. Thank you.
The folder structure is the following:
/data/Sample_1/....._525_1_fq.gz /....._525_2_fq.gz /....._526_1_fq.gz /....._526_2_fq.gz
/data/Sample_2/....._580_1_fq.gz /....._580_2_fq.gz /....._589_1_fq.gz /....._589_2_fq.gz
/data/Sample_3/....._690_1_fq.gz /....._690_2_fq.gz /....._645_1_fq.gz /....._645_2_fq.gz
Below I have attached a screenshot of the folder structure.
Folder structure
Based on the provided file structure, would you please try:
#!/bin/bash
for d in Raw2/C*/; do
(
cd "$d"
id=${d%/}; id=${id##*/} # extract ID from the directory name
cat V*_1.fq.gz > "${id}_R1.fq.gz"
cat V*_2.fq.gz > "${id}_R2.fq.gz"
)
done
The syntax for d in Raw2/C*/ loops over the subdirectories starting with C.
The parentheses make the inner commands executed in a subshell so we don't have to care about returning from cd "$d" (at the expense of small extra execution time).
The variable id is assigned to the ID extracted from the directory name.
cat V*_1.fq.gz, for example, will be expanded as V350028825_L04_581_1.fq.gz V350028825_L04_582_1.fq.gz V350028825_L04_583_1.fq.gz ... according to the files in the directory and are concatenated into ${id}_R1.fastq.gz. Same for ${id}_R2.fastq.gz.

How do I append the contents of numerous files to a single file?

I have 44 RTF files (file1.rtf, file2.rtf, ..., file44.rtf) and I need to combine them all into a single file (either file1.rtf or a new file altogether).
I understand that the way to combine the contents of two files is like this:
cat file2.rtf >> file1.rtf
This example appends the contents of file2.rtf into file1.rtf.
I also understand that I need to iterate through the files, which I can achieve like this:
for file in *.rtf;
do
# do something;
done
As such, I have this which appears to do the job:
#!/bin/bash
for file in *.rtf;
do
cat $file >> "../combined.rtf";
echo "File $file added."
done
But there is an issue: when I run cat ../combined.rtf I see the combined documents but when I run open ../combined.rtf it only shows me the contents of file1.rtf (in LibreOffice Writer).
Where have I gone wrong?

cat multiple files in multiple folders via a loop

I have a set of files which have fastq.gz inputs (R3003_C0UH3ACXX_ATTACTCG_L005_R1_001.fastq.gz ) In these files the RXXXX numbers are not sequential. However I got all the folder names whch begin with R in a txt file , one in each line. Each folder has different number of *_R1 and *_R2 files. I would like to merge them and I have used cat *_R1.fastq.gz and *_R2.fastq.gz indiviually to do this .But I have 500 samples and it is impossible to run them all like this by going into each individual folder and doing cat. How do I get to open each folder in the txt file, run cat to get the final RXXX.R1.fastq.gz and RXXXX.R2.fastq.gz for each folder with a loop?
My file name is R3.txt which lists the RXXX files and I tried
for i in cat R3 ; do cat $i*R1*.fastq.gz >$i\.R1.fastq.gz; done
But it doesn't do what I hoping

Searching through .txt files in different folders for a specific string

New to bash scripting here.
I have a folder (/month) which contains more folders (/month/jan, /month/feb, /month/mar etc.) and within these folders there are .txt files (Sales11.txt, Sales17.txt etc.). These text files contain a staff ID number and their sales results as a percentage
e.g Sales11.txt contents are
20456 78
20512 46
20498 67
20645 88
I am looking to search through these .txt files for a Staff ID number and when this exists to make a text file in the staff members folder /staff/20512 (which already exist) by the name Jan.txt or whichever month it has occurred in. The contents of the Jan.txt file will be the name of the sales file and the percent. There could be more than one sales event in each month.
Example output file would be to /staff/20512 the file named Jan.txt which would contain
Sales11 46
Sales17 98
I think I need to include a if loop and use an array to search through the different folders and within this use the grep function to search for the staff id.
I’m not 100% on what order these should be included and how to make use of multiple different arrays in a single script, if that is even possible. My first attempt is below.
while read STAFFID ; do
ARRSTAFF=($STAFFID)
ARRMON=($MONTHS)
ARRSALE=($SALES)
if [ grep -r “/month/${ARRMON}/${ARRSALE}.txt” -e “${ARRSTAFF}” ]; then
echo “${ARRSALE[0]} ${ARRSALE[1]}” >> Staff/${ARRSTAFF[0]}/${ARRMON}.txt
fi
done < contents/Staff.txt
Since there are no staff IDs in the data that fail to appear Staff.txt, the latter provides no additional information. It can be ignored. Every line in every daily report file will correspond to one line in one of the generated per-staff files. It is simpler, then, to read each daily file just once, and to handle all its contents in that run.
Furthermore, it's unclear whether there is any particular advantage to building up arrays in memory. Doing so makes the script more complicated, and that's a lose unless you gain something substantial in return.
Here's one way you could approach the problem:
# keep one level of backups of existing target files
for file in /staff/*.txt; do
mv "${file}" "${file}.bak"
done
# Process the data files once each
for file in /month/*/*.txt; do
# extract relevant filename parts
month=$(basename $(dirname "${file}"))
filebase=$(basename "${file%.txt}")
# read and report out all the lines of the file
while read staffid value; do
echo "$filebase" "$value" >> "/staff/${staffid}/${month}.txt"
done < "$file"
done
That assumes primarily that the daily report files are formatted as you describe, with no extra fields and no whitespace within the fields. It does not rely on Staff.txt at all.

Resources