Process Hundreds of Input Files Using AWK - shell

I have 3 folders containing files as follow:
Folder1 contains only 1 file called "data".
Folder2 contains more than a hundred files that their names start with "part1" with the same text structure.
Folder3 contains more than a hundred files that their names start with "part2" with the same text structure.
I've created a program using AWK that takes as input the file from folder1, only 1 file from folder2 and only 1 file from folder3, and it works well.
Now i want to give the program all the files from all the folders as input, therefore, i need a test method to know that the program has finished from the first 2 files (part1* + part2*) and will start to process the next ones, in order to reset all the variables and arrays for the new processing.
The program will be run like this:
$ awkprogram folder1/data folder2/part1* folder3/part2*

Something like this maybe?
FNR==1 { # for every first record of every file
filecounter++ # count how manyth file is being processed
}
FNR==1 && filecounter > 2 { # once two first files has been processed
# reset variables # do whatever
}

Related

for loop concatenating files that share part of their common basename (paired end sequencing reads)

I'm trying to concatenate a bunch of paired files into one file (for those who work with sequencing data, you'll be familiar with the paired-end read format).
For example, I have
SLG1.R1.fastq.gz
SLG1.R2.fastq.gz
SLG2.R1.fastq.gz
SLG2.R2.fastq.gz
SLG3.R1.fastq.gz
SLG3.R2.fastq.gz
etc.
I need to concatenate the two SLG1 files, the two SLG2 files, and the two SLG3 files.
So far I have this:
cd /workidr/slg/diet_manip/filtered_concatenated_reads/nonhost
for i in *1.fastq.gz
do
base=(basename $i "1.fastq.gz")
cat ${base}1.fastq.gz ${base}2.fastq.gz > /workdir/slg/diet_manip/filtered_concatenated_reads/cat/${base}.fastq.gz
done
The original files are all in the /filtered_concatenated_reads/nonhost directory, and I want the concatenated versions to be in /filtered_concatenated_reads/cat
The above code gives this error:
-bash: /workdir/slg/diet_manip/filtered_concatenated_reads/cat/basename.fastq.gz: No such file or directory
Any ideas?
Thank you!!

Moving or copying files from one directory to multiple directories

I have a directory /input/source that contains files like:
$ ls -1 /input/source
FileName_1.txt
FileName_2.txt
FileName_3.txt
FileName_4.txt
I also have a file named /tmp/temp_path.txt that contains the following paths (or directories) information like:
$ cat /tmp/temp_path.txt
/output/dest/temp_1
/output/dest/temp_2
/output/dest/temp_3
/output/dest/temp_4
I am trying to build a shell script which will read the path info from /tmp/temp_path.txt and then move the files from the source folder to the respective directories as follows
/output/dest/temp_1/FileName_1.txt
/output/dest/temp_2/FileName_2.txt
/output/dest/temp_3/FileName_3.txt
/output/dest/temp_4/FileName_4.txt
Also, wanted to do some checks and balances before moving the files to destination path like:
if count of dest path from file(/tmp/temp_path.txt) != count of input files from source folder
then move randomly
else
move sequentially
How can I do this?

How to loop over multiple folders to concatenate FastQ files?

I have received multiple fastq.gz files from Illumina Sequencing for 100 samples. But all the fastq.gz files for the respective samples are in separate folders according to the sample ID. Moreover, I have multiple (8-16) R1.fastq.gz and R2.fastq.gz files for one sample. So, I used the following code for concatenating all the R1.fastq.gz and R2.fastq.gz into a single R1.fastq.gz and R2.fastq.gz.
cat V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz > sample_R1.fq.gz
So in the sequencing file, the structure is like the above in the code. For each sample, the string with V has different number then L with different number and then another string of digits before the _1 and _2. For each sample, the numbers keep changing.
My questing is, how can I create a loop that will go over all the folders at once taking the different file numbering of sequence files into consideration for concatenating the multiple fq.gz files and combine them into a single R1 and R2 file?
Surely, I cannot just concatenate one by one by going into each sample folder.
Please give some helpful tips. Thank you.
The folder structure is the following:
/data/Sample_1/....._525_1_fq.gz /....._525_2_fq.gz /....._526_1_fq.gz /....._526_2_fq.gz
/data/Sample_2/....._580_1_fq.gz /....._580_2_fq.gz /....._589_1_fq.gz /....._589_2_fq.gz
/data/Sample_3/....._690_1_fq.gz /....._690_2_fq.gz /....._645_1_fq.gz /....._645_2_fq.gz
Below I have attached a screenshot of the folder structure.
Folder structure
Based on the provided file structure, would you please try:
#!/bin/bash
for d in Raw2/C*/; do
(
cd "$d"
id=${d%/}; id=${id##*/} # extract ID from the directory name
cat V*_1.fq.gz > "${id}_R1.fq.gz"
cat V*_2.fq.gz > "${id}_R2.fq.gz"
)
done
The syntax for d in Raw2/C*/ loops over the subdirectories starting with C.
The parentheses make the inner commands executed in a subshell so we don't have to care about returning from cd "$d" (at the expense of small extra execution time).
The variable id is assigned to the ID extracted from the directory name.
cat V*_1.fq.gz, for example, will be expanded as V350028825_L04_581_1.fq.gz V350028825_L04_582_1.fq.gz V350028825_L04_583_1.fq.gz ... according to the files in the directory and are concatenated into ${id}_R1.fastq.gz. Same for ${id}_R2.fastq.gz.

cat multiple files in multiple folders via a loop

I have a set of files which have fastq.gz inputs (R3003_C0UH3ACXX_ATTACTCG_L005_R1_001.fastq.gz ) In these files the RXXXX numbers are not sequential. However I got all the folder names whch begin with R in a txt file , one in each line. Each folder has different number of *_R1 and *_R2 files. I would like to merge them and I have used cat *_R1.fastq.gz and *_R2.fastq.gz indiviually to do this .But I have 500 samples and it is impossible to run them all like this by going into each individual folder and doing cat. How do I get to open each folder in the txt file, run cat to get the final RXXX.R1.fastq.gz and RXXXX.R2.fastq.gz for each folder with a loop?
My file name is R3.txt which lists the RXXX files and I tried
for i in cat R3 ; do cat $i*R1*.fastq.gz >$i\.R1.fastq.gz; done
But it doesn't do what I hoping

Do multiple executable & independent zip files by files number or filesize

I need a shell script that basically does this:
Zip a folder while doing a different zip file every xxx files (or xxx Mo)
I could not use the split method cause I need all file to be independent from each other.
So if I have a folder with 345 files, it will results in
zip-1.zip, zip-2.zip, zip-3.zip, zip-4.zip
I could not find a solution.
PS, there is no sub-folder.
zipsplit -n 1048576 files.zip
Creates independent zip files, each 10 MB.
The -n option unfortunately reads only bytes - use e.g. http://www.whatsabyte.com/P1/byteconverter.htm

Resources