Bash script to recursively traverse directories, compare and sync files - shell

I'm trying to write a bash shell script to sync content on two different paths.
The algorithm I'm striving for consists of the following steps
given two full (as opposed to relative) paths
recursively compare files (whose filename optionally may have
basename and suffix) in corresponding directories of both paths
if either corresponding directories or files are not present, then
copy each file (from the path with the folder) to the other
corresponding folder.
I've figured out steps 1 and 2 which are
OLD_IFS=$IFS
# The extra space after is crucial
IFS=\
for old_file in `diff -rq old/ new/ | grep "^Files.*differ$" | sed 's/^Files \(.*\) and .* differ$/\1/'`
do
mv $old_file $old_file.old
done
IFS=$OLD_IFS
Thanks.

I have implemented a similar algorithm in Java, which essentially boils down to this:
Retrieve a listing of directories A and B, e.g. A.lst and B.lst
Create the intersection of both listings (e.g. cat A.lst B.lst | sort | uniq -d). This is the list of files you need to actually compare; you will also have to descend to any directories recursively.
You may want to have a look at the conditional expressions supported by your shell (e.g. for bash) or by the test command. I would also suggest using cmp instead of diff.
Note: you need to consider what the proper action should be when you have a directory on one side and a file on the other with the same name.
Find the files that are only present in A (e.g. cat A.lst B.lst B.lst | sort | uniq -u) and copy them recursively (cp -a) to B.
Similarly, find the files that are only present in B and copy them recursively to A.
EDIT:
I forgot to mention a significant optimization: if you sort the file lists A.lst and B.lst beforehand, you can use comm instead of cat ... | sort | uniq ... to perform the set operations:
Intersection: comm -12 A.sorted.lst B.sorted.lst
Files that exist only in A: comm -23 A.sorted.lst B.sorted.lst
Files that exist only in B: comm -13 A.sorted.lst B.sorted.lst

There exists a ready-made solution (shell script), based on find (also using the same idea as yours), to synchronize two directories: https://github.com/Fitus/Zaloha.sh.
Documentation is here: https://github.com/Fitus/Zaloha.sh/blob/master/DOCUMENTATION.md.
Cheers

Related

How do I get the list of all items in dir1 which don't exist in dir2?

I want to compute the difference between two directories - but not in the sense of diff, i.e. not of file and subdirectory contents, but rather just in terms of the list of items. Thus if the directories have the following files:
dir1
dir2
f1 f2 f4
f2 f3
I want to get f1 and f4.
You can use comm to compare two listings:
comm -23 <(ls dir1) <(ls dir2)
process substitution with <(cmd) passes the output of cmd as if it were a file name. It's similar to $(cmd) but instead of capturing the output as a string it generates a dynamic file name (usually /dev/fd/###).
comm prints three columns of information: lines unique to file 1, lines unique to file 2, and lines that appear in both. -23 hides the second and third columns and shows only lines unique to file 1.
You could extend this to do a recursive diff using find. If you do that you'll need to suppress the leading directories from the output, which can be done with a couple of strategic cds.
comm -23 <(cd dir1; find) <(cd dir2; find)
Edit: A naive diff-based solution + improvement due to #JohnKugelamn! :
diff --suppress-common-lines <(\ls dir1) <(\ls dir2) | egrep "^<" | cut -c3-
Instead of working on directories, we switch to working on files; then we use regular diff, taking only lines appearing in the first file, which diff marks by < - then finally removing that marking.
Naturally one could beautify the above by checking for errors, verifying we've gotten two arguments, printing usage information otherwise etc.

Bash: How to sort files in a directory by number of lines?

I'm new to bash so my knowledge is pretty limited, especially when it comes to writing scripts. Is there any way in which a script can read into the directories it is given and sort the files inside in descending order, from the one with most lines to the one with least?
find /some/dir -type f -print0 | wc -l --files0-from=- | sort -n -r
Should do what you want
the find program scans directory /some/dir recursively, outputs the full path for each file it finds (-type f means file as in not a directory/socket/etc). the output list uses nul-terminated strings (-print0) in order to safely deal with dodgy filenames.
That list of filenames feeds into wc (wordcount) which uses (--files0-from=-) to expect a nul-terminated filelist as input, and for each file it prints the number of lines (-l) in front of the filename.
That list in turn, feeds into sort which sorts the list in reverse (-r) numeric (-n) fashion; and since the linecount is in front of the filenames, that means the longest file (most lines) is on top.

Concatenating multiple fastq files and renaming to parent folder

Currently, I have genome sequencing data as fq.gz files and each sample consists of a folder titled with the sample name containing 2-8 pairs of forward and reverse reads. I would like to concatenate all of the forward and reverse files into one forward and one reverse file for each sample while maintaining pair order.
My data are organized as follows:
/ParentFolder/SampleA/V549_1.fq.gz
/ParentFolder/SampleA/V549_2.fq.gz
/ParentFolder/SampleA/V550_1.fq.gz
/ParentFolder/SampleA/V550_2.fq.gz
/ParentFolder/SampleB/V588_1.fq.gz
/ParentFolder/SampleB/V588_2.fq.gz
/ParentFolder/SampleB/V599_1.fq.gz
/ParentFolder/SampleB/V599_2.fq.gz
In order to concatenate the files, I tried the following:
ls *_1.fq.gz | sort | xargs cat > SampleA_1.fq.gz
ls *_2.fq.gz | sort | xargs cat > SampleA_2.fq.gz
This works for one sample folder, however, since I have many dozens of samples, I would like to write a bash script to accomplish this across all my samples and rename the concatenated files to name of their parent folder. I am still learning but I am a bit lost, I would greatly appreciate any help with this problem.
I have attempted the following, without success:
for i in $(find ./ -type f -name "*.fq.gz"; done | sort | uniq)
do echo "Merging 1"
cat "$i"*_1.fq.gz > "$i"CG1-1_1.fq.gz
Thank you for any input/advice/solutions.
Consider the following logic:
for each suffix (_1, _2):
Find all the fq.gz files
Extract list of folders
For each folder
Merge all the containing 'fq' files into new 'all.FOLDER.fq.gz'
p=pp
for suffix in 1 2 ; do
# Find all dirs containing suffix files.
dirs=$(printf '%s\n' $p/*/*_$suffix.fq.gz | sed 's:/[^/]*$::' | uniq)
for d in $dirs ; do
# Merge, and store in parent.
(cd $d ; cat *_${suffix}.fq.gz > ../${d##*/}_${suffix}.fq.gz)
done
done
Notes:
code assume no special characters in folder names.
More compact files will be created if the merge process will uncompressed the original data, and re-compress. (gzcat *.gz

Sort files based on content

I have around 1000 files from a phylogenetic analysis and each file looks something like this
File 1
(((A:0.03550734102561460548,B:0.04004337325891465377):0.01263892787244691278,C:0.03773369182398536581):0.08345900687345568503,(D:0.04441859961888914438,((E:0.04707945363105774106,F:0.03769496882665739068):0.00478087012691866091,G:0.01269975716918288373):0.01263526019405349088):0.13087200352448438712,H:1.91169780510990117151):0.0;
File 12
((A:0.11176787864288327545,(B:0.18283029119402782747,C:0.12136417416322728413):0.02367730683755531543):0.21101090994668161849,(((F:0.06464548582830945134,E:0.06903977777526745796):0.01710921464740109560,G:0.01577242071367901746):0.00958883666063858192,D:0.03506359820882300193):0.47082738536589324729,H:2.94973933657097164840):0.0;
I want to read the content of each file, and classify them according to the patterns (meaning the file content). The numbers here represent the branch length and will not be the same for any of the files. So, I would like to classify the files based on the alphabets A to H. Say, for instance all the files that have the alphabets from A to H arranged in the same order, I would like to sort those files into separate folders. For example:
For the pattern in File1, the pattern will be something like this ignoring the numbers(branch length):
(((A:,B:),C:):,(D:,((E:,F:):,G:):):,H:):;
And all the files that contain this pattern will go into a folder.
File 1
File 5
File 6
File 10
....
I know to sort contents based on a particular pattern using:
grep -l -Z pattern files | xargs -0 mv -t target-directory --
But am not sure how to do it for this case here as I do not have a prior knowledge of the patterns
you can get the content patterns and sort them
$ for f in file{1..2};
do printf "%s\t" $f; tr -d '[ 0-9.]' <$f;
done |
sort -k2
file1 (((A:,B:):,C:):,(D:,((E:,F:):,G:):):,H:):;
file2 ((A:,(B:,C:):):,(((F:,E:):,G:):,D:):,H:):;
same patterns will be consecutive. This assumes you have one record per file.

How to delete one set of files in a directory containing similarly named files?

A series of several hundred directories contains files in the following pattern:
Dir1:
-text_76.txt
-text_81.txt
-sim_76.py
-sim_81.py
Dir2:
-text_90.txt
-text_01.txt
-sim_90.py
-sim_01.py
Within each directory, the files beginning with text or sim are essentially duplicates of the other text or sim file, respectively. Each set of duplicate files has a unique numerical identifier. I only want one set per directory. So, in Dir1, I would like to delete everything in the set labeled either 81 OR 76, with no preference. Likewise, in Dir2, I would like to delete either the set labeled 90 OR 01. Each directory contains exactly two sets, and there is no way to predict the random numerical IDs used in each directory. How can I do this?
Assuming you always have 1 known file, say text_xx.txt then you could run this script in each sub-directory:
ls text_*.txt | { read first; rm *"${first:4:4}"*; };
This will list all files matching the wildcard pattern text_*.txt. Using read takes only the first matching result of the ls command. This will result in a $first shell variable containing one fully expanded match: text_xx.txt. After that ${first:4:4} sub-strings this fully expanded match to get the characters _xx. by knowing the length of test_ and xx. Finally, rm *""* appends wild cards to the search result and executes it as a command: rm *_xx.*.
I chose to include _ and . around xx to be a bit conservative about what gets deleted.
If the length of xx is not known, things gets a bit more complicated. A safer command unsure of this length might be:
ls text_??.txt | { read first; rm *_"${first:5:2}".*; };
This should remove one "fileset" every time it is run in a given sub-directory. If there is only 1 fileset, it would still remove the fileset.
Edit: Simplified to remove unnecessary use of IFS command.
Edit: Attempt to expand on and clarify the explanation.
ls | grep -P "*[81|76]*" | xargs -d"\n" rm
ls | grep -P "*[90|01]*" | xargs -d"\n" rm
How it works:
ls lists all files (one by line since the result is piped).
grep -P filter
xargs -d"\n" rm executes rm line once for every line that is piped to it.

Resources