I have a set of files which have fastq.gz inputs (R3003_C0UH3ACXX_ATTACTCG_L005_R1_001.fastq.gz ) In these files the RXXXX numbers are not sequential. However I got all the folder names whch begin with R in a txt file , one in each line. Each folder has different number of *_R1 and *_R2 files. I would like to merge them and I have used cat *_R1.fastq.gz and *_R2.fastq.gz indiviually to do this .But I have 500 samples and it is impossible to run them all like this by going into each individual folder and doing cat. How do I get to open each folder in the txt file, run cat to get the final RXXX.R1.fastq.gz and RXXXX.R2.fastq.gz for each folder with a loop?
My file name is R3.txt which lists the RXXX files and I tried
for i in cat R3 ; do cat $i*R1*.fastq.gz >$i\.R1.fastq.gz; done
But it doesn't do what I hoping
Related
I'm trying to concatenate a bunch of paired files into one file (for those who work with sequencing data, you'll be familiar with the paired-end read format).
For example, I have
SLG1.R1.fastq.gz
SLG1.R2.fastq.gz
SLG2.R1.fastq.gz
SLG2.R2.fastq.gz
SLG3.R1.fastq.gz
SLG3.R2.fastq.gz
etc.
I need to concatenate the two SLG1 files, the two SLG2 files, and the two SLG3 files.
So far I have this:
cd /workidr/slg/diet_manip/filtered_concatenated_reads/nonhost
for i in *1.fastq.gz
do
base=(basename $i "1.fastq.gz")
cat ${base}1.fastq.gz ${base}2.fastq.gz > /workdir/slg/diet_manip/filtered_concatenated_reads/cat/${base}.fastq.gz
done
The original files are all in the /filtered_concatenated_reads/nonhost directory, and I want the concatenated versions to be in /filtered_concatenated_reads/cat
The above code gives this error:
-bash: /workdir/slg/diet_manip/filtered_concatenated_reads/cat/basename.fastq.gz: No such file or directory
Any ideas?
Thank you!!
i have a problem, i used "everything" to extract every txt file from a specific directory so that i can merge them. But on emeditor i don't find a way to merge file from a list of localisation.
Here what the everything file look like:
E:\Main directory\subdirectory 1\file.txt
E:\Main directory\subdirectory 2\file.txt
E:\Main directory\subdirectory 3\file.txt
E:\Main directory\subdirectory 4\file.txt
The list goes over 40k location. is there a way to use a program to read all the location in the text file and combine them ?
Also, the subdirectory has other txt file that i don't want to so i can't just merge all txt file from the main. Another thing is that there are variation of the "file.txt" like "Files.txt" for example.
I have received multiple fastq.gz files from Illumina Sequencing for 100 samples. But all the fastq.gz files for the respective samples are in separate folders according to the sample ID. Moreover, I have multiple (8-16) R1.fastq.gz and R2.fastq.gz files for one sample. So, I used the following code for concatenating all the R1.fastq.gz and R2.fastq.gz into a single R1.fastq.gz and R2.fastq.gz.
cat V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz > sample_R1.fq.gz
So in the sequencing file, the structure is like the above in the code. For each sample, the string with V has different number then L with different number and then another string of digits before the _1 and _2. For each sample, the numbers keep changing.
My questing is, how can I create a loop that will go over all the folders at once taking the different file numbering of sequence files into consideration for concatenating the multiple fq.gz files and combine them into a single R1 and R2 file?
Surely, I cannot just concatenate one by one by going into each sample folder.
Please give some helpful tips. Thank you.
The folder structure is the following:
/data/Sample_1/....._525_1_fq.gz /....._525_2_fq.gz /....._526_1_fq.gz /....._526_2_fq.gz
/data/Sample_2/....._580_1_fq.gz /....._580_2_fq.gz /....._589_1_fq.gz /....._589_2_fq.gz
/data/Sample_3/....._690_1_fq.gz /....._690_2_fq.gz /....._645_1_fq.gz /....._645_2_fq.gz
Below I have attached a screenshot of the folder structure.
Folder structure
Based on the provided file structure, would you please try:
#!/bin/bash
for d in Raw2/C*/; do
(
cd "$d"
id=${d%/}; id=${id##*/} # extract ID from the directory name
cat V*_1.fq.gz > "${id}_R1.fq.gz"
cat V*_2.fq.gz > "${id}_R2.fq.gz"
)
done
The syntax for d in Raw2/C*/ loops over the subdirectories starting with C.
The parentheses make the inner commands executed in a subshell so we don't have to care about returning from cd "$d" (at the expense of small extra execution time).
The variable id is assigned to the ID extracted from the directory name.
cat V*_1.fq.gz, for example, will be expanded as V350028825_L04_581_1.fq.gz V350028825_L04_582_1.fq.gz V350028825_L04_583_1.fq.gz ... according to the files in the directory and are concatenated into ${id}_R1.fastq.gz. Same for ${id}_R2.fastq.gz.
I need to combine all the csv files in some directory (.csv), provided that there are other files with the same name in this directory, but with different expansion (.csv.done).
If a csv file doesn't have .done in this extension then I don't need it for combine process.
What is the best way to do it using Bash ?
This approach is a solution to your problem. I see you've commented that it "didn't work", but whatever the reason is for it not working, it's likely simple to fix e.g. if you forgot to include key details, or failed to adapt it appropriately to suit your specific situation. If you need further help troubleshooting, add more info to your question.
The approach:
for f in *.csv.done
do
cat "${f%.*}" >> combined_file.csv
done
How it works:
In your example, you have 3 files named 1.csv 2.csv 3.csv and two 'done' files named 1.csv.done 2.csv.done.
This script begins by making a list of all files that end in .csv.done (two files: 1.csv.done 2.csv.done).
It then uses a parameter expansion, specifically ${parameter%word}, to 'shorten' the name of the two files in the list to .csv (instead of .csv.done).
Then it 'prints' the content of the two 'shortened' filenames (1.csv and 2.csv) into a 'combined' file.
It doesn't 'print' the content of 1.csv.done or 2.csv.done, or 3.csv, because these files weren't in the original 'list'.
If you run this script multiple times, it will keep adding the contents of files 1.csv and 2.csv to the 'combined' file (only run it once, or delete the 'combined' file before running it again)
I'm new in Bash and I have a list of names of directories stored in an excel file. I'd like to find those directories (they are located in different location at the computer) and to copy from each directory specific files (list of 4 files that ends with specific endings) to a remote computer.
For examples:
For a name of directory at the excel sheet - "NA123", I'd like to find it and copy it's partial content to a remote computer, for example copy the files: samples-sheet.csv, toInfo.xml, newfiles.gz, todo.csv to the remote computer, under a folder name "NA123".
How do I begin to do that?
****Editing to give an example of how it needs to be*****
A short example of the csv is as below:
A
1 14RD00129_TS1_01
2 SD-2015-06_01
3 US-005
4 RA99
All the names at the csv are directories that can be found under /home/bella/samples under 3 different folders: some will be at /home/bella/samples/gruop_1, some at:/home/bella/samples/gruop_2, and some at:/home/bella/samples/gruop_3
So first I need to iterate through the csv file, to locate the match directory at my computer, then I need to copy 4 specific files to a remote computer with the same name of directory. Hope this is clearer...
I guess you CSV file should only consist of directory names then, since there's only one column. I assume there is no header line in the CSV (A in your example) and no line number. You can take this as a starting point:
samples='/home/bella/samples'
while IFS= read -r line; do
dir=$(find "$samples"/gruop_{1..3} -type d -name "$line")
scp "$dir"/{samples-sheet.csv,toInfo.xml,newfiles.gz,todo.csv} \
user#host.com:"/path/to/$line"
done < 'file.csv'
Basically, you could do something like:
# create the directory on the remote:
ssh remote-ip 'mkdir -p NA123'
# copy the files to the remote in the directory just created
for f in samples-sheet.csv toInfo.xml newfiles.gz todo.csv; do scp $f remote-ip:NA123/; done