Concatenate files with same partial id BASH

Concatenate files with same partial id BASH - bash

I have a directory with many fq.gz files. I want to loop over the filenames and concatenate any files with the same partial ID. For example out of the 1000 files in the directory, these six need to be concatenated into a single file (as they share the same ID From "L1" onwards)
141016-FC012-L1-N707-S504--123V_pre--Hs--R1.fq.gz
141031-FC01229-L1-N707-S504--123V_pre--Hs--R1.fq.gz
141020-FC01209-L1-N707-S504--123V_pre--Hs--R1.fq.gz
141027-FC013-L1-N707-S504--123V_pre--Hs--R1.fq.gz
141023-FC01219-L1-N707-S504--123V_pre--Hs--R1.fq.gz
Can anyone help??

Probably not the best way, but this might do what you need:
while IFS= read -r -d '' id; do
cat *"$id" > "/some/location/${id%.fq.gz}_grouped.fq.gz"
done < <(printf '%s\0' *.fq.gz | cut -zd- -f3- | sort -uz)
This will create files with the following format:
<ID>_grouped.fq.gz
L1-N707-S504--123V_pre--Hs--R1_grouped.fq.gz
...
...

Related

how to produce multiple readlength.tsv at once from multiple fastq files?

ı have 16 fastq files under the different directories to produce readlength.tsv seperately and ı have some script to produce readlength.tsv .this is the script that ı should use to produce readlength.tsv
zcat ~/proje/project/name/fıle_fastq | paste - - - - | cut -f1,2 | while read readID sequ;
do
len=`echo $sequ | wc -m`
echo -e "$readID\t$len"
done > ~/project/name/fıle1_readlength.tsv
one by one ı can produce this readlength but it will take long time .I want to produce readlength at once thats why I created list that involved these fastq fıles but ı couldnt produce any loop to produce readlength.tsv at once from 16 fastq files.
ı would appreaciate ıf you can help me

Assuming a file list.txt contains the 16 file paths such as:
~/proje/project/name/file1_fastq
~/proje/project/name/file2_fastq
..
~/path/to/the/fastq_file16
Then would you please try:
#!/bin/bash
while IFS= read -r f; do # "f" is assigned to each fastq filename in "list.txt"
mapfile -t ary < <(zcat "$f") # assign "ary" to the array of lines
echo -e "${ary[0]}\t${#ary[1]}" # ${ary[0]} is the id and ${#ary[1]} is the length of sequence
done < list.txt > readlength.tsv
As the fastq file format contains the id in the 1st line and the sequence
in the 2nd line, bash built-in mapfile will be better to handle them.
As a side note, the letter ı in your code looks like a non-ascii character.

merge same name multiple part pdf files by order

how merge same name's pdf files
poetry_2.pdf
poetry_3.pdf
poetry_4.pdf
metaphysics_2.pdf
metaphysics_3.pdf
i look for
poetry.pdf
metaphysics.pdf
failed this loop to check pdf files and merge with pfunite
for file1 in *_02.pdf ; do
# get your second_ files
file2=${file1/_02.pdf/_03.pdf}
# merge them together
pdfunite $file1 $file2 $file1.pdf
done

First, you need a list of prefixes (e.g. poetry, metaphysics). Then, iterate over that list and unite prefix_*.pdf into prefix.pdf.
Here we generate the list of prefixes by searching for files ending with _NUMBER.pdf and removing that last part. This assumes that filenames do not contain linebreaks.
printf %s\\n *_*.pdf | sed -En 's/_[0-9]+\.pdf$//p' | sort -u |
while IFS= read -r prefix; do
pdfunite "$prefix"_*.pdf "$prefix.pdf"
done

Concatenating multiple fastq files and renaming to parent folder

Currently, I have genome sequencing data as fq.gz files and each sample consists of a folder titled with the sample name containing 2-8 pairs of forward and reverse reads. I would like to concatenate all of the forward and reverse files into one forward and one reverse file for each sample while maintaining pair order.
My data are organized as follows:
/ParentFolder/SampleA/V549_1.fq.gz
/ParentFolder/SampleA/V549_2.fq.gz
/ParentFolder/SampleA/V550_1.fq.gz
/ParentFolder/SampleA/V550_2.fq.gz
/ParentFolder/SampleB/V588_1.fq.gz
/ParentFolder/SampleB/V588_2.fq.gz
/ParentFolder/SampleB/V599_1.fq.gz
/ParentFolder/SampleB/V599_2.fq.gz
In order to concatenate the files, I tried the following:
ls *_1.fq.gz | sort | xargs cat > SampleA_1.fq.gz
ls *_2.fq.gz | sort | xargs cat > SampleA_2.fq.gz
This works for one sample folder, however, since I have many dozens of samples, I would like to write a bash script to accomplish this across all my samples and rename the concatenated files to name of their parent folder. I am still learning but I am a bit lost, I would greatly appreciate any help with this problem.
I have attempted the following, without success:
for i in $(find ./ -type f -name "*.fq.gz"; done | sort | uniq)
do echo "Merging 1"
cat "$i"*_1.fq.gz > "$i"CG1-1_1.fq.gz
Thank you for any input/advice/solutions.

Consider the following logic:
for each suffix (_1, _2):
Find all the fq.gz files
Extract list of folders
For each folder
Merge all the containing 'fq' files into new 'all.FOLDER.fq.gz'
p=pp
for suffix in 1 2 ; do
# Find all dirs containing suffix files.
dirs=$(printf '%s\n' $p/*/*_$suffix.fq.gz | sed 's:/[^/]*$::' | uniq)
for d in $dirs ; do
# Merge, and store in parent.
(cd $d ; cat *_${suffix}.fq.gz > ../${d##*/}_${suffix}.fq.gz)
done
done
Notes:
code assume no special characters in folder names.
More compact files will be created if the merge process will uncompressed the original data, and re-compress. (gzcat *.gz

Rename genome FASTA files with part of sequence header

I'd like to rename FASTA files with organism name (stored in the file) and the identifier (part of the filename).
All files have the same format in filename and stored data, each file only have one FASTA header and corresponding sequence.
Original filename:
$ head GCF_000008205.1_ASM820v1_genomic.fna
>NC_007295.1 Mycoplasma hyopneumoniae J, complete genome
CCAAAATCAACTTTATTAAATGTGCTAAATAAAGTTGATAAAATGTTTGCAAAAACATTTTTGTTGTTTTAAACAAAACA
AATTGATTTAAAAATTATACTACAAAATTAAAGGAAAATTTATAAAATGCAAACAAATAAAAATAATTTAAAGGTTAGAA
CACAGCAAATTAGACAACAAATTGAAAATTTATTAAATGATCGAATGTTGTATAACAACTTTTTTAGCACAATTTATGTA
...
I'd like to rename only the filename, using the assembly identifier (GCF_000008205.1) in the filename, and the second and third words of the FASTA header (Mycoplasma hyopneumoniae):
Mycoplasma_hyopneumoniae_GCF_000008205.1.fna
I've tried this:
for fname in *.fna; do
mv -- "$fname" \
"$(awk 'NR==1{printf("%s_%s_%s\n",$2,$3,substr($1,2));exit}' "$fname")".fna
done
result:
Mycoplasma_hyopneumoniae_NC_007295.1.fna
But the result shows a code ahead of the name of the organism, instead of the identifier that interests me, which is in the name of the original file.
Thanks!

The following idea works, but only if every single file is formatted like the one in your example.
In the directory that has all your files do the following:
for i in $(ls)
do
name1=$(cat "$i" | grep \> | awk -v OFS='_' '{print $2,$3,_}')
name2=$(basename "$i" | cut -d_ -f 1,2 | sed 's/$/.fna/g')
mv "$i" "${name1}${name2}"
done
I suggest creating a backup folder first before trying it just in case you have some files formatted differently.

Sort files based on content

I have around 1000 files from a phylogenetic analysis and each file looks something like this
File 1
(((A:0.03550734102561460548,B:0.04004337325891465377):0.01263892787244691278,C:0.03773369182398536581):0.08345900687345568503,(D:0.04441859961888914438,((E:0.04707945363105774106,F:0.03769496882665739068):0.00478087012691866091,G:0.01269975716918288373):0.01263526019405349088):0.13087200352448438712,H:1.91169780510990117151):0.0;
File 12
((A:0.11176787864288327545,(B:0.18283029119402782747,C:0.12136417416322728413):0.02367730683755531543):0.21101090994668161849,(((F:0.06464548582830945134,E:0.06903977777526745796):0.01710921464740109560,G:0.01577242071367901746):0.00958883666063858192,D:0.03506359820882300193):0.47082738536589324729,H:2.94973933657097164840):0.0;
I want to read the content of each file, and classify them according to the patterns (meaning the file content). The numbers here represent the branch length and will not be the same for any of the files. So, I would like to classify the files based on the alphabets A to H. Say, for instance all the files that have the alphabets from A to H arranged in the same order, I would like to sort those files into separate folders. For example:
For the pattern in File1, the pattern will be something like this ignoring the numbers(branch length):
(((A:,B:),C:):,(D:,((E:,F:):,G:):):,H:):;
And all the files that contain this pattern will go into a folder.
File 1
File 5
File 6
File 10
....
I know to sort contents based on a particular pattern using:
grep -l -Z pattern files | xargs -0 mv -t target-directory --
But am not sure how to do it for this case here as I do not have a prior knowledge of the patterns

you can get the content patterns and sort them
$ for f in file{1..2};
do printf "%s\t" $f; tr -d '[ 0-9.]' <$f;
done |
sort -k2
file1 (((A:,B:):,C:):,(D:,((E:,F:):,G:):):,H:):;
file2 ((A:,(B:,C:):):,(((F:,E:):,G:):,D:):,H:):;
same patterns will be consecutive. This assumes you have one record per file.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Concatenate files with same partial id BASH - bash

Related

how to produce multiple readlength.tsv at once from multiple fastq files?

merge same name multiple part pdf files by order

Concatenating multiple fastq files and renaming to parent folder

Rename genome FASTA files with part of sequence header

Sort files based on content

Categories

Resources