How to loop over multiple folders to concatenate FastQ files? - bash

I have received multiple fastq.gz files from Illumina Sequencing for 100 samples. But all the fastq.gz files for the respective samples are in separate folders according to the sample ID. Moreover, I have multiple (8-16) R1.fastq.gz and R2.fastq.gz files for one sample. So, I used the following code for concatenating all the R1.fastq.gz and R2.fastq.gz into a single R1.fastq.gz and R2.fastq.gz.
cat V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz > sample_R1.fq.gz
So in the sequencing file, the structure is like the above in the code. For each sample, the string with V has different number then L with different number and then another string of digits before the _1 and _2. For each sample, the numbers keep changing.
My questing is, how can I create a loop that will go over all the folders at once taking the different file numbering of sequence files into consideration for concatenating the multiple fq.gz files and combine them into a single R1 and R2 file?
Surely, I cannot just concatenate one by one by going into each sample folder.
Please give some helpful tips. Thank you.
The folder structure is the following:
/data/Sample_1/....._525_1_fq.gz /....._525_2_fq.gz /....._526_1_fq.gz /....._526_2_fq.gz
/data/Sample_2/....._580_1_fq.gz /....._580_2_fq.gz /....._589_1_fq.gz /....._589_2_fq.gz
/data/Sample_3/....._690_1_fq.gz /....._690_2_fq.gz /....._645_1_fq.gz /....._645_2_fq.gz
Below I have attached a screenshot of the folder structure.
Folder structure

Based on the provided file structure, would you please try:
#!/bin/bash
for d in Raw2/C*/; do
(
cd "$d"
id=${d%/}; id=${id##*/} # extract ID from the directory name
cat V*_1.fq.gz > "${id}_R1.fq.gz"
cat V*_2.fq.gz > "${id}_R2.fq.gz"
)
done
The syntax for d in Raw2/C*/ loops over the subdirectories starting with C.
The parentheses make the inner commands executed in a subshell so we don't have to care about returning from cd "$d" (at the expense of small extra execution time).
The variable id is assigned to the ID extracted from the directory name.
cat V*_1.fq.gz, for example, will be expanded as V350028825_L04_581_1.fq.gz V350028825_L04_582_1.fq.gz V350028825_L04_583_1.fq.gz ... according to the files in the directory and are concatenated into ${id}_R1.fastq.gz. Same for ${id}_R2.fastq.gz.

Related

for loop concatenating files that share part of their common basename (paired end sequencing reads)

I'm trying to concatenate a bunch of paired files into one file (for those who work with sequencing data, you'll be familiar with the paired-end read format).
For example, I have
SLG1.R1.fastq.gz
SLG1.R2.fastq.gz
SLG2.R1.fastq.gz
SLG2.R2.fastq.gz
SLG3.R1.fastq.gz
SLG3.R2.fastq.gz
etc.
I need to concatenate the two SLG1 files, the two SLG2 files, and the two SLG3 files.
So far I have this:
cd /workidr/slg/diet_manip/filtered_concatenated_reads/nonhost
for i in *1.fastq.gz
do
base=(basename $i "1.fastq.gz")
cat ${base}1.fastq.gz ${base}2.fastq.gz > /workdir/slg/diet_manip/filtered_concatenated_reads/cat/${base}.fastq.gz
done
The original files are all in the /filtered_concatenated_reads/nonhost directory, and I want the concatenated versions to be in /filtered_concatenated_reads/cat
The above code gives this error:
-bash: /workdir/slg/diet_manip/filtered_concatenated_reads/cat/basename.fastq.gz: No such file or directory
Any ideas?
Thank you!!

Combine CSV files with condition

I need to combine all the csv files in some directory (.csv), provided that there are other files with the same name in this directory, but with different expansion (.csv.done).
If a csv file doesn't have .done in this extension then I don't need it for combine process.
What is the best way to do it using Bash ?
This approach is a solution to your problem. I see you've commented that it "didn't work", but whatever the reason is for it not working, it's likely simple to fix e.g. if you forgot to include key details, or failed to adapt it appropriately to suit your specific situation. If you need further help troubleshooting, add more info to your question.
The approach:
for f in *.csv.done
do
cat "${f%.*}" >> combined_file.csv
done
How it works:
In your example, you have 3 files named 1.csv 2.csv 3.csv and two 'done' files named 1.csv.done 2.csv.done.
This script begins by making a list of all files that end in .csv.done (two files: 1.csv.done 2.csv.done).
It then uses a parameter expansion, specifically ${parameter%word}, to 'shorten' the name of the two files in the list to .csv (instead of .csv.done).
Then it 'prints' the content of the two 'shortened' filenames (1.csv and 2.csv) into a 'combined' file.
It doesn't 'print' the content of 1.csv.done or 2.csv.done, or 3.csv, because these files weren't in the original 'list'.
If you run this script multiple times, it will keep adding the contents of files 1.csv and 2.csv to the 'combined' file (only run it once, or delete the 'combined' file before running it again)

How to find duplicate directories

Let create some testing directory tree:
#!/bin/bash
top="./testdir"
[[ -e "$top" ]] && { echo "$top already exists!" >&2; exit 1; }
mkfile() { printf "%s\n" $(basename "$1") > "$1"; }
mkdir -p "$top"/d1/d1{1,2}
mkdir -p "$top"/d2/d1some/d12copy
mkfile "$top/d1/d12/a"
mkfile "$top/d1/d12/b"
mkfile "$top/d2/d1some/d12copy/a"
mkfile "$top/d2/d1some/d12copy/b"
mkfile "$top/d2/x"
mkfile "$top/z"
The structure is: find testdir \( -type d -printf "%p/\n" , -type f -print \)
testdir/
testdir/d1/
testdir/d1/d11/
testdir/d1/d12/
testdir/d1/d12/a
testdir/d1/d12/b
testdir/d2/
testdir/d2/d1some/
testdir/d2/d1some/d12copy/
testdir/d2/d1some/d12copy/a
testdir/d2/d1some/d12copy/b
testdir/d2/x
testdir/z
I need find the duplicate directories, but I need consider only files (e.g. I should ignore (sub)directories without files). So, from the above test-tree the wanted result is:
duplicate directories:
testdir/d1
testdir/d2/d1some
because in both (sub)trees are only two identical files a and b. (and several directories, without files).
Of course, I could md5deep -Zr ., also could walk the whole tree using perl script (using File::Find+Digest::MD5 or using Path::Tiny or like.) and calculate the file's md5-digests, but this doesn't helps for finding the duplicate directories... :(
Any idea how to do this? Honestly, I haven't any idea.
EDIT
I don't need working code. (I'm able to code myself)
I "just" need some ideas "how to approach" the solution of the problem. :)
Edit2
The rationale behind - why need this: I have approx 2.5 TB data copied from many external HDD's as a result of wrong backup-strategy. E.g. over the years, the whole $HOME dirs are copied into (many different) external HDD's. Many sub-directories has the same content, but they're in different paths. So, now I trying to eliminate the same-content directories.
And I need do this by directories, because here are directories, which has some duplicates files, but not all. Let say:
/some/path/project1/a
/some/path/project1/b
and
/some/path/project2/a
/some/path/project2/x
e.g. the a is a duplicate file (not only the name, but by the content too) - but it is needed for the both projects. So i want keep the a in both directories - even if they're duplicate files. Therefore me looking for a "logic" how to find duplicate directories.
Some key points:
If I understand right (from your comment, where you said: "(Also, when me saying identical files I mean identical by their content, not by their name)" , you want find duplicate directories, e.g. where their content is exactly the same as in some other directory, regardless of the file-names.
for this you must calculate some checksum or digest for the files. Identical digest = identical file. (with great probability). :) As you already said, the md5deep -Zr -of /top/dir is a good starting point.
I added the -of, because for such job you don't want calculate the contents of the symlinks-targets, or other special files like fifo - just plain files.
calculating the md5 for each file in 2.5TB tree, sure will take few hours of work, unless you have very fast machine. The md5deep runs a thread for each cpu-core. So, while it runs, you can make some scripts.
Also, consider run the md5deep as sudo, because it could be frustrating if after a long run-time you will get some error-messages about unreadable files, only because you forgot to change the files-ownerships...(Just a note) :) :)
For the "how to":
For comparing "directories" you need calculate some "directory-digest", for easy compare and finding duplicates.
The one most important thing is realize the following key points:
you could exclude directories, where are files with unique digests. If the file is unique, e.g. has not any duplicates, that's mean that is pointless checking it's directory. Unique file in some directory means, that the directory is unique too. So, the script should ignore every directory where are files with unique MD5 digests (from the md5deep's output.)
You don't need calculate the "directory-digest" from the files itself. (as you trying it in your followup question). It is enough to calculate the "directory digest" using the already calculated md5 for the files, just must ensure that you sort them first!
e.g. for example if your directory /path/to/some containing only two files a and b and
if file "a" has md5 : 0cc175b9c0f1b6a831c399e269772661
and file "b" has md5: 92eb5ffee6ae2fec3ad71c777531578f
you can calculate the "directory-digest" from the above file-digests, e.g. using the Digest::MD5 you could do:
perl -MDigest::MD5=md5_hex -E 'say md5_hex(sort qw( 92eb5ffee6ae2fec3ad71c777531578f 0cc175b9c0f1b6a831c399e269772661))'
and will get 3bc22fb7aaebe9c8c5d7de312b876bb8 as your "directory-digest". The sort is crucial(!) here, because the same command, but without the sort:
perl -MDigest::MD5=md5_hex -E 'say md5_hex(qw( 92eb5ffee6ae2fec3ad71c777531578f 0cc175b9c0f1b6a831c399e269772661))'
produces: 3a13f2408f269db87ef0110a90e168ae.
Note, even if the above digests aren't the digests of your files, but they're will be unique for every directory with different files and will be the same for the identical files. (because identical files, has identical md5 file-digest). The sorting ensures, that you will calculate the digest always in the same order, e.g. if some other directory will contain two files
file "aaa" has md5 : 92eb5ffee6ae2fec3ad71c777531578f
file "bbb" has md5 : 0cc175b9c0f1b6a831c399e269772661
using the above sort and md5 you will again get: 3bc22fb7aaebe9c8c5d7de312b876bb8 - e.g. the directory containing same files as above...
So, in such way you can calculate some "directory-digest" for every directory you have and could be ensured that if you get another directory digest 3bc22fb7aaebe9c8c5d7de312b876bb8 thats means: this directory has exactly the above two files a and b (even if their names are different).
This method is fast, because you will calculate the "directory-digests" only from small 32bytes strings, so you avoids excessive multiple file-digest-caclulations.
The final part is easy now. Your final data should be in form:
3a13f2408f269db87ef0110a90e168ae /some/directory
16ea2389b5e62bc66b873e27072b0d20 /another/directory
3a13f2408f269db87ef0110a90e168ae /path/to/other/directory
so, from this is easy to get: the
/some/directory and the /path/to/other/directory are identical, because they has identical "directory-digests".
Hm... All the above is only a few lines long perl script. Probably would be faster to write here directly the perl-script as the above long textual answer - but, you said - you don't want code... :) :)
A traversal can identify directories which are duplicates in the sense you describe. I take it that this is: if all files in a directory are equal to all files of another then their paths are duplicates.
Find all files in each directory and form a string with their names. You can concatenate the names with a comma, say (or some other sequence that is certainly not in any names). This is to be compared. Prepend the path to this string, so to identify directories.
Comparison can be done for instance by populating a hash with keys being strings with filenames and path their values. Once you find that a key already exists you can check the content of files, and add the path to the list of duplicates.
The strings with path don't have to be actually formed, as you can build the hash and dupes list during the traversal. Having the full list first allows for other kinds of accounting, if desired.
This is altogether very little code to write.
An example. Let's say that you have
dir1/subdir1/{a,b} # duplicates (files 'a' and 'b' are considered equal)
dir2/subdir2/{a,b}
and
proj1/subproj1/{a,b,X} # NOT duplicates, since there are different files
proj2/subproj2/{a,b,Y}
The above prescription would give you strings
'dir1/subdir1/a,b',
'dir2/subdir2/a,b',
'proj1/subproj1/a,b,X',
'proj2/subproj2/a,b,Y';
where the (sub)string 'a,b' identifies dir1/subdir1 and dir2/subdir2 as duplicates.
I don't see how you can avoid a traversal to build a system that accounts for all files.
The procedure above is the first step, not handling directories with files and subdirectories.
Consider
dirA/ dirB/
a b sdA/ a X sdB/
c d c d
Here the paths dirA/sdA/ and dirB/sdB/ are duplicates by the problem description but the whole dirA/ and dirB/ are distinct. This isn't shown in the question but I'd expect it to be of interest.
The procedure from the first part can be modified for this. Iterate through directories, forming a path component at every step. Get all files in each, and all subdirectories (if none we are done). Append the comma-separated file list to the path component (/sdA/). So the representation of the above is
'dirA/sdA,a,b/c,d', 'dirB/sdB,a,X/c,d'
For each file-list substring (c,d) found to already exist we can check its path against the existing one, component by component. Now a hash with keys like c,d won't do since this example has the same file-list for distinct hierarchies, but a modified (or other) data structure is needed.
Finally, there may be more subdirectories parallel to sdA (say sdA2). We care only for its own path, but except for the parallel files (a,b, in that component of the path dirA/sdaA2,a,b/). So keep in mind all bottom-level file-lists (c,d) with their paths and, if file-lists are equal and paths are of same length, check whether their paths have a,b file-lists equal in each path component.
I don't know whether this is a workable solution for you, but I'd expect "near-duplicates" to be rare -- the backup is either a duplicate or not. So there may not be much need to handle futher edge-cases in complex sprawling hierarchies. This procedure should be at least a useful pre-selection mechanism, that would greatly reduce the need for further work.
This assumes that equal file-names very likely indicate equal files. A part of that is my expectation that if a file was even just renamed it still cannot be considered a duplicate. If this is not so this approach won't work and one would need something along the lines of the answer by jm666.
I make a tool which searches duplicate folders.
https://github.com/un1t/dirdups
dirdups testdir -i 1
-i 1 option consider folders as duplicates if they have at least 1 file in common. Without this option default value is 10.
In your case it will find the following directories:
testdir/d1/d12/
testdir/d2/d1some/d12copy/

How to match numbering of files across different folders e.g. rename NAME9.txt to NAME00009.txt

I have a huge list of files, they came through different processes, so for some reason the ones in the first folder are numbered like this
A9.txt A1.txt while the ones in the other have A00009.txt A.00001.txt
I have no more than 99837 files so only four "extra" 0 on one side.
I need to rename all the files inside one folder so the names matches. Is there any way to do this in a loop? Thanks for the help.
You should take a look at perl-rename (Sometimes called rename) Not to be confused with rename from util-linux.
perl-rename 's/\d+/sprintf("%05d", $&)/e' *.txt
The above script will rename all .txt files in a directory to the following:
A1.txt -> A00001.txt
A10.txt -> A00010.txt
Hello225.txt -> Hello00225.txt
Test it Online

Terminals - Creating Multiple Identical Folders within Subdirectories and Moving Files

I have a bunch of files I'm trying to organize quickly, and I had two questions about how to do that. I really appreciate any help! I tried searching but couldn't find anything on these specific commands for OSX.
First, I have about 100 folders in a directory - I'd like to place an folder in each one of those folders.
For example, I have
Cars/Mercedes/<br>
Cars/BMW/<br>
Cars/Audi/<br>
Cars/Jeep/<br>
Cars/Tesla/
Is there a way I can create a folder inside each of those named "Pricing" in one command, i.e. ->
Cars/Mercedes/Pricing <br>
Cars/BMW/Pricing<br>
Cars/Audi/Pricing<br>
Cars/Jeep/Pricing<br>
Cars/Tesla/Pricing
My second question is a little tougher to explain. In each of these folders, I'd like move certain files into these newly created folders (above) in the subdirectory.
Each file has a slightly different filename but contains the same string of letters - for example, in each of the above folders, I might have
Cars/Mercedes/payment123.html
Cars/BMW/payment432.html
Cars/Audi/payment999.html
Cars/Jeep/payment283.html
Is there a way to search each subdirectory for a file containing the string "payment" and move that file into a subfolder in that subdirecotry - i.e. into the hypothetical "Pricing" folders we just created above with one command for all the subdirectories in Cars?
Thanks so much~! help with either of these would be invaluable.
I will assume you are using bash, since it is the default shell in OS X. One way to do this uses a for loop over each directory to create the subdirectory and move the file. Wildcards are used to find all of the directories and the file.
for DIR in Cars/*/ ; do
mkdir "${DIR}Pricing"
mv "${DIR}payment*.html" "${DIR}Pricing/"
done
The first line finds every directory in Cars, and then runs the loop once for each, replacing ${DIR} with the current directory. The second line creates the subdirectory using the substitution. Note the double quotes, which are necessary only if the path could contain spaces. The third line moves any file in the directory whose name starts with "payment" and ends with ".html" to the subdirectory. If you have multiple files which match this, they will all be moved. The fourth line simply marks the end of the loop.
If you are typing this directly into the command line, you can combine it into a single line:
for DIR in Cars/*/ ; do mkdir "${DIR}Pricing"; mv "${DIR}payment*.html" "${DIR}Pricing/"; done

Resources