Concatenating multiple fastq files and renaming to parent folder - bash

Currently, I have genome sequencing data as fq.gz files and each sample consists of a folder titled with the sample name containing 2-8 pairs of forward and reverse reads. I would like to concatenate all of the forward and reverse files into one forward and one reverse file for each sample while maintaining pair order.
My data are organized as follows:
/ParentFolder/SampleA/V549_1.fq.gz
/ParentFolder/SampleA/V549_2.fq.gz
/ParentFolder/SampleA/V550_1.fq.gz
/ParentFolder/SampleA/V550_2.fq.gz
/ParentFolder/SampleB/V588_1.fq.gz
/ParentFolder/SampleB/V588_2.fq.gz
/ParentFolder/SampleB/V599_1.fq.gz
/ParentFolder/SampleB/V599_2.fq.gz
In order to concatenate the files, I tried the following:
ls *_1.fq.gz | sort | xargs cat > SampleA_1.fq.gz
ls *_2.fq.gz | sort | xargs cat > SampleA_2.fq.gz
This works for one sample folder, however, since I have many dozens of samples, I would like to write a bash script to accomplish this across all my samples and rename the concatenated files to name of their parent folder. I am still learning but I am a bit lost, I would greatly appreciate any help with this problem.
I have attempted the following, without success:
for i in $(find ./ -type f -name "*.fq.gz"; done | sort | uniq)
do echo "Merging 1"
cat "$i"*_1.fq.gz > "$i"CG1-1_1.fq.gz
Thank you for any input/advice/solutions.

Consider the following logic:
for each suffix (_1, _2):
Find all the fq.gz files
Extract list of folders
For each folder
Merge all the containing 'fq' files into new 'all.FOLDER.fq.gz'
p=pp
for suffix in 1 2 ; do
# Find all dirs containing suffix files.
dirs=$(printf '%s\n' $p/*/*_$suffix.fq.gz | sed 's:/[^/]*$::' | uniq)
for d in $dirs ; do
# Merge, and store in parent.
(cd $d ; cat *_${suffix}.fq.gz > ../${d##*/}_${suffix}.fq.gz)
done
done
Notes:
code assume no special characters in folder names.
More compact files will be created if the merge process will uncompressed the original data, and re-compress. (gzcat *.gz

Related

How to list latest timestamps files from partial duplicate name

I have more than 10k files like as given in below example. I would like to filter out partial duplicate i.e 123456 is common in all listed files which are consider to be duplicate and out of these duplicate files I need file of latest time stamps
123456_20200425-012034.xml
123456_20200424-120102.xml
123456_20200425-121102.xml
234567_20200323-112232.xml
123456_20200423-111102.xml --- consider that this file is having latest
timestamps out of all above duplicate file
How to do it using bash ?
Also output should have files which are not duplicate. It means out of 10K files few files are not duplicate those files should include in output.
Output is require like (latest timestamps files)
123456_20200423-111102.xml
234567_20200323-112232.xml
I have done like this:
list=$(ls | awk -F _ '{print $1}' | uniq)
for i in $list
do
mv "$(find . -type f -name "$i*" -print | sort -n -t _ -k 2 | tail -1)" ../destination
done
1) Stored uniq files in list
2) Executed list file in to loop, find latest timestamp file and move it to destination folder
Because we can assume that globs are sorted alphanumerically, we can use a wildcard to iterate over the files and build a set of results:
#!/bin/bash
# change INPUTDIR to your input directory
INPUTDIR=.
seen=
store=()
for file in "$INPUTDIR"/* ; do
if [[ "$seen" != *"${file%_*}"* ]] ; then
store+=( "$file" )
seen="$seen ${file%_*}"
fi
done
# results
echo "${store[#]}"
Explanation:
Iterate over all files in a directory.
Get the filename before the first underscore (i.e. 123456). If we've haven't it before (i.e. "$seen" != *"${file%_*}"*), add it to our list of files to store. If we have seen it before, skip the file.
Print the results.

Is it possible to grep using an array as pattern?

TL;DR
How to filter an ls/find output using grep
with an array as a pattern?
Background story:
I have a pipeline which I have to rerun for datasets which run into an error.
Which datasets are run into an error is saved in a tab separated file.
I want to delete the files where the pipeline has run into an error.
To do so I extracted the dataset names from another file containing the finished dataset and saved them in a bash array {ds1 ds2 ...} but now I am stuck because I cannot figure out how to exclude the datasets in the array from my deletion step.
This is the folder structure (X=1-30):
datasets/dsX/results/dsX.tsv
Not excluding the finished datasets, meaning deleting the folders of the failed and the finished datasets works like a charm
#1. move content to a trash folder
ls /datasets/*/results/*|xargs -I '{}' mv '{}' ./trash/
#2. delete the empty folders
find /datasets/*/. -type d -empty -delete
But since I want to exclude the finished datasets I thought it would be clever to save them in an array:
#find finished datasets by extracting the dataset names from a tab separated log file
mapfile -t -s 1 finished < <(awk '{print $2}' $path/$log_pf)
echo ${finished[#]}
which works as expected but now I am stuck in filtering the ls output using that array:
*pseudocode
#trying to ignore the dataset in the array - not working
ls -I${finished[#]} -d /datasets/*/
#trying to reverse grep for the finished datasets - not working
ls /datasets/*/ | grep -v {finished}
What do you think about my current ideas?
Is this possible using bash only? I guess in python I could do that easily
but for training purposes, I want to do it in bash.
grep can get the patterns from a file using the -f option. Note that file names containing newlines will cause problems.
If you need to process the input somehow, you can use process substitution:
grep -f <(process the input...)
I must admit I'm confused about what you're doing but if you're just trying to produce a list of files excluding those stored in column 2 of some other file and your file/directory names can't contain spaces then that'd be:
find /datasets -type f | awk 'NR==FNR{a[$2]; next} !($0 in a)' "$path/$log_pf" -
If that's not all you need then please edit your question to clarify your requirements and add concise testable sample input and expected output.

Sort files based on content

I have around 1000 files from a phylogenetic analysis and each file looks something like this
File 1
(((A:0.03550734102561460548,B:0.04004337325891465377):0.01263892787244691278,C:0.03773369182398536581):0.08345900687345568503,(D:0.04441859961888914438,((E:0.04707945363105774106,F:0.03769496882665739068):0.00478087012691866091,G:0.01269975716918288373):0.01263526019405349088):0.13087200352448438712,H:1.91169780510990117151):0.0;
File 12
((A:0.11176787864288327545,(B:0.18283029119402782747,C:0.12136417416322728413):0.02367730683755531543):0.21101090994668161849,(((F:0.06464548582830945134,E:0.06903977777526745796):0.01710921464740109560,G:0.01577242071367901746):0.00958883666063858192,D:0.03506359820882300193):0.47082738536589324729,H:2.94973933657097164840):0.0;
I want to read the content of each file, and classify them according to the patterns (meaning the file content). The numbers here represent the branch length and will not be the same for any of the files. So, I would like to classify the files based on the alphabets A to H. Say, for instance all the files that have the alphabets from A to H arranged in the same order, I would like to sort those files into separate folders. For example:
For the pattern in File1, the pattern will be something like this ignoring the numbers(branch length):
(((A:,B:),C:):,(D:,((E:,F:):,G:):):,H:):;
And all the files that contain this pattern will go into a folder.
File 1
File 5
File 6
File 10
....
I know to sort contents based on a particular pattern using:
grep -l -Z pattern files | xargs -0 mv -t target-directory --
But am not sure how to do it for this case here as I do not have a prior knowledge of the patterns
you can get the content patterns and sort them
$ for f in file{1..2};
do printf "%s\t" $f; tr -d '[ 0-9.]' <$f;
done |
sort -k2
file1 (((A:,B:):,C:):,(D:,((E:,F:):,G:):):,H:):;
file2 ((A:,(B:,C:):):,(((F:,E:):,G:):,D:):,H:):;
same patterns will be consecutive. This assumes you have one record per file.

Bash script to recursively traverse directories, compare and sync files

I'm trying to write a bash shell script to sync content on two different paths.
The algorithm I'm striving for consists of the following steps
given two full (as opposed to relative) paths
recursively compare files (whose filename optionally may have
basename and suffix) in corresponding directories of both paths
if either corresponding directories or files are not present, then
copy each file (from the path with the folder) to the other
corresponding folder.
I've figured out steps 1 and 2 which are
OLD_IFS=$IFS
# The extra space after is crucial
IFS=\
for old_file in `diff -rq old/ new/ | grep "^Files.*differ$" | sed 's/^Files \(.*\) and .* differ$/\1/'`
do
mv $old_file $old_file.old
done
IFS=$OLD_IFS
Thanks.
I have implemented a similar algorithm in Java, which essentially boils down to this:
Retrieve a listing of directories A and B, e.g. A.lst and B.lst
Create the intersection of both listings (e.g. cat A.lst B.lst | sort | uniq -d). This is the list of files you need to actually compare; you will also have to descend to any directories recursively.
You may want to have a look at the conditional expressions supported by your shell (e.g. for bash) or by the test command. I would also suggest using cmp instead of diff.
Note: you need to consider what the proper action should be when you have a directory on one side and a file on the other with the same name.
Find the files that are only present in A (e.g. cat A.lst B.lst B.lst | sort | uniq -u) and copy them recursively (cp -a) to B.
Similarly, find the files that are only present in B and copy them recursively to A.
EDIT:
I forgot to mention a significant optimization: if you sort the file lists A.lst and B.lst beforehand, you can use comm instead of cat ... | sort | uniq ... to perform the set operations:
Intersection: comm -12 A.sorted.lst B.sorted.lst
Files that exist only in A: comm -23 A.sorted.lst B.sorted.lst
Files that exist only in B: comm -13 A.sorted.lst B.sorted.lst
There exists a ready-made solution (shell script), based on find (also using the same idea as yours), to synchronize two directories: https://github.com/Fitus/Zaloha.sh.
Documentation is here: https://github.com/Fitus/Zaloha.sh/blob/master/DOCUMENTATION.md.
Cheers

Need a way to gather total size of JAR files

I am new to more advanced bash commands. I need a way to count the size of external libraries in our codeline. There are a few main directories but I also have a spreadsheet with the actual locations of the libraries that need to be included.
I have fiddled with find and du but it is unclear to me how to specify multiple locations. Can I find the size of several hundred jars listed in the spreadsheet instead of approximating with the main directories?
edit: I can now find the size of specific files. I had to export the excel spreadsheet with the locations to a csv. In PSPad I "joined lines" and copy and paste that directly into the list_of_files slot. (find list_of_files | xargs du -hc). I could not get find to utilize a file containing the locations separated by a space/tab/line.
Now I can't tell if replacing list_of_files with list_of_directories will work. It looks like it counts things twice e.g.
1.0M /folder/dummy/image1.jpg
1.0M /folder/dummy/image2.jpg
2.0M /folder/dummy
3.0M /folder/image3.jpg
7.0M /folder
14.0M total
This is fake but if it's counting like this then that is not what I want. The reason I suspect this is because the total I'm getting seems really high.
Do you mean...
find list_of_directories | xargs du -hc
Then, if you want to exactly pipe to du the files that are listed in the spredsheet you need a way to filter them out. Is it a text file or which format?
find `(cat file)` | xargs du -hc
might do it if they are in a txt file as a list separated by spaces. Probably you will have some issues regarding the spaces... You have to quote the filenames.
for fn in `find DIR1 DIR2 FILE1 -name *.jar`; do du $fn; done | awk '{TOTAL += $1} END {print TOTAL}'
You can specify your files and directories in place of DIR1, DIR2, FILE1, etc. You can list their individual sizes by removing the piped awk command.

Resources