How to merge files from keyword filename? - bash

I have many files in a folder, contain
4628_group_1
3643_group_0
7578_group_1
4684_group_0
Finally, I merge file into 2 groups
Group1.csv is merged from 4628_group_1 and 7578_group_1
Group0.csv is merged from 3643_group_0 and 4684_group_0

Depending on what you mean by merge, you may be able to achieve this with two simple cat commands.
cat *_group_0 > Group0.csv
cat *_group_1 > Group1.csv

Related

for loop concatenating files that share part of their common basename (paired end sequencing reads)

I'm trying to concatenate a bunch of paired files into one file (for those who work with sequencing data, you'll be familiar with the paired-end read format).
For example, I have
SLG1.R1.fastq.gz
SLG1.R2.fastq.gz
SLG2.R1.fastq.gz
SLG2.R2.fastq.gz
SLG3.R1.fastq.gz
SLG3.R2.fastq.gz
etc.
I need to concatenate the two SLG1 files, the two SLG2 files, and the two SLG3 files.
So far I have this:
cd /workidr/slg/diet_manip/filtered_concatenated_reads/nonhost
for i in *1.fastq.gz
do
base=(basename $i "1.fastq.gz")
cat ${base}1.fastq.gz ${base}2.fastq.gz > /workdir/slg/diet_manip/filtered_concatenated_reads/cat/${base}.fastq.gz
done
The original files are all in the /filtered_concatenated_reads/nonhost directory, and I want the concatenated versions to be in /filtered_concatenated_reads/cat
The above code gives this error:
-bash: /workdir/slg/diet_manip/filtered_concatenated_reads/cat/basename.fastq.gz: No such file or directory
Any ideas?
Thank you!!

How to loop over multiple folders to concatenate FastQ files?

I have received multiple fastq.gz files from Illumina Sequencing for 100 samples. But all the fastq.gz files for the respective samples are in separate folders according to the sample ID. Moreover, I have multiple (8-16) R1.fastq.gz and R2.fastq.gz files for one sample. So, I used the following code for concatenating all the R1.fastq.gz and R2.fastq.gz into a single R1.fastq.gz and R2.fastq.gz.
cat V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz > sample_R1.fq.gz
So in the sequencing file, the structure is like the above in the code. For each sample, the string with V has different number then L with different number and then another string of digits before the _1 and _2. For each sample, the numbers keep changing.
My questing is, how can I create a loop that will go over all the folders at once taking the different file numbering of sequence files into consideration for concatenating the multiple fq.gz files and combine them into a single R1 and R2 file?
Surely, I cannot just concatenate one by one by going into each sample folder.
Please give some helpful tips. Thank you.
The folder structure is the following:
/data/Sample_1/....._525_1_fq.gz /....._525_2_fq.gz /....._526_1_fq.gz /....._526_2_fq.gz
/data/Sample_2/....._580_1_fq.gz /....._580_2_fq.gz /....._589_1_fq.gz /....._589_2_fq.gz
/data/Sample_3/....._690_1_fq.gz /....._690_2_fq.gz /....._645_1_fq.gz /....._645_2_fq.gz
Below I have attached a screenshot of the folder structure.
Folder structure
Based on the provided file structure, would you please try:
#!/bin/bash
for d in Raw2/C*/; do
(
cd "$d"
id=${d%/}; id=${id##*/} # extract ID from the directory name
cat V*_1.fq.gz > "${id}_R1.fq.gz"
cat V*_2.fq.gz > "${id}_R2.fq.gz"
)
done
The syntax for d in Raw2/C*/ loops over the subdirectories starting with C.
The parentheses make the inner commands executed in a subshell so we don't have to care about returning from cd "$d" (at the expense of small extra execution time).
The variable id is assigned to the ID extracted from the directory name.
cat V*_1.fq.gz, for example, will be expanded as V350028825_L04_581_1.fq.gz V350028825_L04_582_1.fq.gz V350028825_L04_583_1.fq.gz ... according to the files in the directory and are concatenated into ${id}_R1.fastq.gz. Same for ${id}_R2.fastq.gz.

Combine CSV files with condition

I need to combine all the csv files in some directory (.csv), provided that there are other files with the same name in this directory, but with different expansion (.csv.done).
If a csv file doesn't have .done in this extension then I don't need it for combine process.
What is the best way to do it using Bash ?
This approach is a solution to your problem. I see you've commented that it "didn't work", but whatever the reason is for it not working, it's likely simple to fix e.g. if you forgot to include key details, or failed to adapt it appropriately to suit your specific situation. If you need further help troubleshooting, add more info to your question.
The approach:
for f in *.csv.done
do
cat "${f%.*}" >> combined_file.csv
done
How it works:
In your example, you have 3 files named 1.csv 2.csv 3.csv and two 'done' files named 1.csv.done 2.csv.done.
This script begins by making a list of all files that end in .csv.done (two files: 1.csv.done 2.csv.done).
It then uses a parameter expansion, specifically ${parameter%word}, to 'shorten' the name of the two files in the list to .csv (instead of .csv.done).
Then it 'prints' the content of the two 'shortened' filenames (1.csv and 2.csv) into a 'combined' file.
It doesn't 'print' the content of 1.csv.done or 2.csv.done, or 3.csv, because these files weren't in the original 'list'.
If you run this script multiple times, it will keep adding the contents of files 1.csv and 2.csv to the 'combined' file (only run it once, or delete the 'combined' file before running it again)

How to match numbering of files across different folders e.g. rename NAME9.txt to NAME00009.txt

I have a huge list of files, they came through different processes, so for some reason the ones in the first folder are numbered like this
A9.txt A1.txt while the ones in the other have A00009.txt A.00001.txt
I have no more than 99837 files so only four "extra" 0 on one side.
I need to rename all the files inside one folder so the names matches. Is there any way to do this in a loop? Thanks for the help.
You should take a look at perl-rename (Sometimes called rename) Not to be confused with rename from util-linux.
perl-rename 's/\d+/sprintf("%05d", $&)/e' *.txt
The above script will rename all .txt files in a directory to the following:
A1.txt -> A00001.txt
A10.txt -> A00010.txt
Hello225.txt -> Hello00225.txt
Test it Online

diff of identical tar archive return they are not identical

I have a script that generates a tar archive using the command
tar -zacf /tmp/foo.tar.gz /home/yotam/foo
it then check if a tar file is in a certain folder, and check if there is any changes between the two archives, if so, it keeps the new one
if ! [ -e /home/yotam/barr/foo.tar.gz ]; then
cp /tmp/foo.tar.gz /home/yotam/bar/foo.tar.gz
cond=1
else
#compare
diff --brief <(sort /tmp/foo.tar.gz) <(sort /home/yotam/bar/foo.tar.gz) >/dev/null
cond=$?
fi
if [ $cond -eq 1 ]; then
rm /home/yotam/bar/foo.tar.gz
cp /tmp/foo.tar.gz /home/yotam/foo.tar.gz
fi
However, this script always view the two archive files as different, even if I'm not doing anything in any of the two archives or the foo folder itself. What is wrong with my check?
Edit:
for what it worth, replacing the diff file with
diff --brief /tmp/foo.tar.gz /home/yotam/bar/foo.tar.gz >/dev/null
yield the same result.
I'm not sure that gzip archive can be used as a hash-function. Perhaps gzip packaging implementation relies on current date-time and then produces different output for each execution.
I'd recommend to use some widely used hash function. Take a look at git internal hash implementation - shasum, for example.
More at: How does git compute file hashes?
It looks like you're doing a line-wise compare of zipped tar archives, after sorting the lines. There are multiple reasons why this is a bad idea (for one: sorting by like for something that is gzipped doesn't make sense). To check whether 2 files, either use diff file1 file2, or calculate a hash for each file (with md5/md5sum filename) and compare those.
The problem is that gzip adds the name of the files it gzips in the zip archive. If you have 2 identical files and then gzip these, you will get 2 different archives.
So what can you do to solve this? For one you can compare gunziped versions of both files: diff <(gzcat out/out2.tar.gz) <(gzcat out2.tar.gz). I assume you have the sort in there in case the files get tarred in a different order, but I don't think you have to worry about that. If that is a problem for you, check out something like tarsum. This will give you a better result, since if you use sort, you will not notice moving a line from one file to the other, or switching two lines in a file.

Resources