merge same name multiple part pdf files by order - bash

how merge same name's pdf files
poetry_2.pdf
poetry_3.pdf
poetry_4.pdf
metaphysics_2.pdf
metaphysics_3.pdf
i look for
poetry.pdf
metaphysics.pdf
failed this loop to check pdf files and merge with pfunite
for file1 in *_02.pdf ; do
# get your second_ files
file2=${file1/_02.pdf/_03.pdf}
# merge them together
pdfunite $file1 $file2 $file1.pdf
done

First, you need a list of prefixes (e.g. poetry, metaphysics). Then, iterate over that list and unite prefix_*.pdf into prefix.pdf.
Here we generate the list of prefixes by searching for files ending with _NUMBER.pdf and removing that last part. This assumes that filenames do not contain linebreaks.
printf %s\\n *_*.pdf | sed -En 's/_[0-9]+\.pdf$//p' | sort -u |
while IFS= read -r prefix; do
pdfunite "$prefix"_*.pdf "$prefix.pdf"
done

Related

how to produce multiple readlength.tsv at once from multiple fastq files?

ı have 16 fastq files under the different directories to produce readlength.tsv seperately and ı have some script to produce readlength.tsv .this is the script that ı should use to produce readlength.tsv
zcat ~/proje/project/name/fıle_fastq | paste - - - - | cut -f1,2 | while read readID sequ;
do
len=`echo $sequ | wc -m`
echo -e "$readID\t$len"
done > ~/project/name/fıle1_readlength.tsv
one by one ı can produce this readlength but it will take long time .I want to produce readlength at once thats why I created list that involved these fastq fıles but ı couldnt produce any loop to produce readlength.tsv at once from 16 fastq files.
ı would appreaciate ıf you can help me
Assuming a file list.txt contains the 16 file paths such as:
~/proje/project/name/file1_fastq
~/proje/project/name/file2_fastq
..
~/path/to/the/fastq_file16
Then would you please try:
#!/bin/bash
while IFS= read -r f; do # "f" is assigned to each fastq filename in "list.txt"
mapfile -t ary < <(zcat "$f") # assign "ary" to the array of lines
echo -e "${ary[0]}\t${#ary[1]}" # ${ary[0]} is the id and ${#ary[1]} is the length of sequence
done < list.txt > readlength.tsv
As the fastq file format contains the id in the 1st line and the sequence
in the 2nd line, bash built-in mapfile will be better to handle them.
As a side note, the letter ı in your code looks like a non-ascii character.

How to list latest timestamps files from partial duplicate name

I have more than 10k files like as given in below example. I would like to filter out partial duplicate i.e 123456 is common in all listed files which are consider to be duplicate and out of these duplicate files I need file of latest time stamps
123456_20200425-012034.xml
123456_20200424-120102.xml
123456_20200425-121102.xml
234567_20200323-112232.xml
123456_20200423-111102.xml --- consider that this file is having latest
timestamps out of all above duplicate file
How to do it using bash ?
Also output should have files which are not duplicate. It means out of 10K files few files are not duplicate those files should include in output.
Output is require like (latest timestamps files)
123456_20200423-111102.xml
234567_20200323-112232.xml
I have done like this:
list=$(ls | awk -F _ '{print $1}' | uniq)
for i in $list
do
mv "$(find . -type f -name "$i*" -print | sort -n -t _ -k 2 | tail -1)" ../destination
done
1) Stored uniq files in list
2) Executed list file in to loop, find latest timestamp file and move it to destination folder
Because we can assume that globs are sorted alphanumerically, we can use a wildcard to iterate over the files and build a set of results:
#!/bin/bash
# change INPUTDIR to your input directory
INPUTDIR=.
seen=
store=()
for file in "$INPUTDIR"/* ; do
if [[ "$seen" != *"${file%_*}"* ]] ; then
store+=( "$file" )
seen="$seen ${file%_*}"
fi
done
# results
echo "${store[#]}"
Explanation:
Iterate over all files in a directory.
Get the filename before the first underscore (i.e. 123456). If we've haven't it before (i.e. "$seen" != *"${file%_*}"*), add it to our list of files to store. If we have seen it before, skip the file.
Print the results.

Concatenating multiple fastq files and renaming to parent folder

Currently, I have genome sequencing data as fq.gz files and each sample consists of a folder titled with the sample name containing 2-8 pairs of forward and reverse reads. I would like to concatenate all of the forward and reverse files into one forward and one reverse file for each sample while maintaining pair order.
My data are organized as follows:
/ParentFolder/SampleA/V549_1.fq.gz
/ParentFolder/SampleA/V549_2.fq.gz
/ParentFolder/SampleA/V550_1.fq.gz
/ParentFolder/SampleA/V550_2.fq.gz
/ParentFolder/SampleB/V588_1.fq.gz
/ParentFolder/SampleB/V588_2.fq.gz
/ParentFolder/SampleB/V599_1.fq.gz
/ParentFolder/SampleB/V599_2.fq.gz
In order to concatenate the files, I tried the following:
ls *_1.fq.gz | sort | xargs cat > SampleA_1.fq.gz
ls *_2.fq.gz | sort | xargs cat > SampleA_2.fq.gz
This works for one sample folder, however, since I have many dozens of samples, I would like to write a bash script to accomplish this across all my samples and rename the concatenated files to name of their parent folder. I am still learning but I am a bit lost, I would greatly appreciate any help with this problem.
I have attempted the following, without success:
for i in $(find ./ -type f -name "*.fq.gz"; done | sort | uniq)
do echo "Merging 1"
cat "$i"*_1.fq.gz > "$i"CG1-1_1.fq.gz
Thank you for any input/advice/solutions.
Consider the following logic:
for each suffix (_1, _2):
Find all the fq.gz files
Extract list of folders
For each folder
Merge all the containing 'fq' files into new 'all.FOLDER.fq.gz'
p=pp
for suffix in 1 2 ; do
# Find all dirs containing suffix files.
dirs=$(printf '%s\n' $p/*/*_$suffix.fq.gz | sed 's:/[^/]*$::' | uniq)
for d in $dirs ; do
# Merge, and store in parent.
(cd $d ; cat *_${suffix}.fq.gz > ../${d##*/}_${suffix}.fq.gz)
done
done
Notes:
code assume no special characters in folder names.
More compact files will be created if the merge process will uncompressed the original data, and re-compress. (gzcat *.gz

Concatenate files with same partial id BASH

I have a directory with many fq.gz files. I want to loop over the filenames and concatenate any files with the same partial ID. For example out of the 1000 files in the directory, these six need to be concatenated into a single file (as they share the same ID From "L1" onwards)
141016-FC012-L1-N707-S504--123V_pre--Hs--R1.fq.gz
141031-FC01229-L1-N707-S504--123V_pre--Hs--R1.fq.gz
141020-FC01209-L1-N707-S504--123V_pre--Hs--R1.fq.gz
141027-FC013-L1-N707-S504--123V_pre--Hs--R1.fq.gz
141023-FC01219-L1-N707-S504--123V_pre--Hs--R1.fq.gz
Can anyone help??
Probably not the best way, but this might do what you need:
while IFS= read -r -d '' id; do
cat *"$id" > "/some/location/${id%.fq.gz}_grouped.fq.gz"
done < <(printf '%s\0' *.fq.gz | cut -zd- -f3- | sort -uz)
This will create files with the following format:
<ID>_grouped.fq.gz
L1-N707-S504--123V_pre--Hs--R1_grouped.fq.gz
...
...

Sort files based on content

I have around 1000 files from a phylogenetic analysis and each file looks something like this
File 1
(((A:0.03550734102561460548,B:0.04004337325891465377):0.01263892787244691278,C:0.03773369182398536581):0.08345900687345568503,(D:0.04441859961888914438,((E:0.04707945363105774106,F:0.03769496882665739068):0.00478087012691866091,G:0.01269975716918288373):0.01263526019405349088):0.13087200352448438712,H:1.91169780510990117151):0.0;
File 12
((A:0.11176787864288327545,(B:0.18283029119402782747,C:0.12136417416322728413):0.02367730683755531543):0.21101090994668161849,(((F:0.06464548582830945134,E:0.06903977777526745796):0.01710921464740109560,G:0.01577242071367901746):0.00958883666063858192,D:0.03506359820882300193):0.47082738536589324729,H:2.94973933657097164840):0.0;
I want to read the content of each file, and classify them according to the patterns (meaning the file content). The numbers here represent the branch length and will not be the same for any of the files. So, I would like to classify the files based on the alphabets A to H. Say, for instance all the files that have the alphabets from A to H arranged in the same order, I would like to sort those files into separate folders. For example:
For the pattern in File1, the pattern will be something like this ignoring the numbers(branch length):
(((A:,B:),C:):,(D:,((E:,F:):,G:):):,H:):;
And all the files that contain this pattern will go into a folder.
File 1
File 5
File 6
File 10
....
I know to sort contents based on a particular pattern using:
grep -l -Z pattern files | xargs -0 mv -t target-directory --
But am not sure how to do it for this case here as I do not have a prior knowledge of the patterns
you can get the content patterns and sort them
$ for f in file{1..2};
do printf "%s\t" $f; tr -d '[ 0-9.]' <$f;
done |
sort -k2
file1 (((A:,B:):,C:):,(D:,((E:,F:):,G:):):,H:):;
file2 ((A:,(B:,C:):):,(((F:,E:):,G:):,D:):,H:):;
same patterns will be consecutive. This assumes you have one record per file.

Resources