column-wise merging of multiple files in specific order - bash

I want to perform column-wise merging of multiple files considering the increasing order of file names. To be specific, I have renamed 163 files as 1.lrr, 2.lrr,3.lrr...163.lrr and I used following command to merge multiple files:
Paste -d "\t" *.lrr > all_samples.lrr
However, It combined column in some strange order of filenames. It started file merging with the file 100.lrr instead of file 1.lrr. Later on, it combined column from files 101.lrr until 109.lrr. Is it possible to modify this command so that it also considers numerically sorting of file names while merging the column?

Try this:
paste $(ls | grep -E "*.lrr" | sort -n) > all_samples.lrr

Related

Is it possible to grep using an array as pattern?

TL;DR
How to filter an ls/find output using grep
with an array as a pattern?
Background story:
I have a pipeline which I have to rerun for datasets which run into an error.
Which datasets are run into an error is saved in a tab separated file.
I want to delete the files where the pipeline has run into an error.
To do so I extracted the dataset names from another file containing the finished dataset and saved them in a bash array {ds1 ds2 ...} but now I am stuck because I cannot figure out how to exclude the datasets in the array from my deletion step.
This is the folder structure (X=1-30):
datasets/dsX/results/dsX.tsv
Not excluding the finished datasets, meaning deleting the folders of the failed and the finished datasets works like a charm
#1. move content to a trash folder
ls /datasets/*/results/*|xargs -I '{}' mv '{}' ./trash/
#2. delete the empty folders
find /datasets/*/. -type d -empty -delete
But since I want to exclude the finished datasets I thought it would be clever to save them in an array:
#find finished datasets by extracting the dataset names from a tab separated log file
mapfile -t -s 1 finished < <(awk '{print $2}' $path/$log_pf)
echo ${finished[#]}
which works as expected but now I am stuck in filtering the ls output using that array:
*pseudocode
#trying to ignore the dataset in the array - not working
ls -I${finished[#]} -d /datasets/*/
#trying to reverse grep for the finished datasets - not working
ls /datasets/*/ | grep -v {finished}
What do you think about my current ideas?
Is this possible using bash only? I guess in python I could do that easily
but for training purposes, I want to do it in bash.
grep can get the patterns from a file using the -f option. Note that file names containing newlines will cause problems.
If you need to process the input somehow, you can use process substitution:
grep -f <(process the input...)
I must admit I'm confused about what you're doing but if you're just trying to produce a list of files excluding those stored in column 2 of some other file and your file/directory names can't contain spaces then that'd be:
find /datasets -type f | awk 'NR==FNR{a[$2]; next} !($0 in a)' "$path/$log_pf" -
If that's not all you need then please edit your question to clarify your requirements and add concise testable sample input and expected output.

Sort files based on content

I have around 1000 files from a phylogenetic analysis and each file looks something like this
File 1
(((A:0.03550734102561460548,B:0.04004337325891465377):0.01263892787244691278,C:0.03773369182398536581):0.08345900687345568503,(D:0.04441859961888914438,((E:0.04707945363105774106,F:0.03769496882665739068):0.00478087012691866091,G:0.01269975716918288373):0.01263526019405349088):0.13087200352448438712,H:1.91169780510990117151):0.0;
File 12
((A:0.11176787864288327545,(B:0.18283029119402782747,C:0.12136417416322728413):0.02367730683755531543):0.21101090994668161849,(((F:0.06464548582830945134,E:0.06903977777526745796):0.01710921464740109560,G:0.01577242071367901746):0.00958883666063858192,D:0.03506359820882300193):0.47082738536589324729,H:2.94973933657097164840):0.0;
I want to read the content of each file, and classify them according to the patterns (meaning the file content). The numbers here represent the branch length and will not be the same for any of the files. So, I would like to classify the files based on the alphabets A to H. Say, for instance all the files that have the alphabets from A to H arranged in the same order, I would like to sort those files into separate folders. For example:
For the pattern in File1, the pattern will be something like this ignoring the numbers(branch length):
(((A:,B:),C:):,(D:,((E:,F:):,G:):):,H:):;
And all the files that contain this pattern will go into a folder.
File 1
File 5
File 6
File 10
....
I know to sort contents based on a particular pattern using:
grep -l -Z pattern files | xargs -0 mv -t target-directory --
But am not sure how to do it for this case here as I do not have a prior knowledge of the patterns
you can get the content patterns and sort them
$ for f in file{1..2};
do printf "%s\t" $f; tr -d '[ 0-9.]' <$f;
done |
sort -k2
file1 (((A:,B:):,C:):,(D:,((E:,F:):,G:):):,H:):;
file2 ((A:,(B:,C:):):,(((F:,E:):,G:):,D:):,H:):;
same patterns will be consecutive. This assumes you have one record per file.

BASH: Loop cut columns from each csv to new csv

I have a number of .csv files all with the same structure of 22 columns. I only require columns 5,14 and 15 so use:
$ cut -d, -f5,14,1 original.csv > new_original.csv
However I will soon have a number of csv coming in daily and need to use a loop function to perform this on each csv file, and add a prefix "new_"for example to the file name. Alternatively I don't mind -i editing in place.
Thanks
You can run the following in the directory that contains the csv files.
for file in *.csv
do
cut -d, -f5,14,1 "$file" > "new_$file.csv"
done
This will loop over each of them, perform the filtering and output to the same name prefixed with new_.

Massive rename of files but keep the same sorting

I have a lot of files in a folder with the same extension (e.g .vtk) and I am using a bash script to massive rename them with sequencial numbers.
Here is the script i use:
n=0;
for file in *.vtk; do
${file} 100_${n}.vtk;
n=$((n+1));
done
After the script's execution, all the files are rename like:
100_1.vtk
100_2.vtk
.
.
.
My problem is that I want to keep the sorting of files exactly the same as it was before. For example, if i had two sequential files named something.vtk and something_else.vtk, I want them after the renaming process, to correspond to 100_1.vtk and 100_2.vtk respectively.
Can you change your for loop from this:
for file in *.vtk; do
to this:
for file in $(ls -1 *.vtk | sort); do
If your filename don't contain spaces, this should work.
You can use sort -kX.Y! X refers to the column and Y to the character.
So, something like following should be fine:
$ ls | sort -k1.5

Bash script to recursively traverse directories, compare and sync files

I'm trying to write a bash shell script to sync content on two different paths.
The algorithm I'm striving for consists of the following steps
given two full (as opposed to relative) paths
recursively compare files (whose filename optionally may have
basename and suffix) in corresponding directories of both paths
if either corresponding directories or files are not present, then
copy each file (from the path with the folder) to the other
corresponding folder.
I've figured out steps 1 and 2 which are
OLD_IFS=$IFS
# The extra space after is crucial
IFS=\
for old_file in `diff -rq old/ new/ | grep "^Files.*differ$" | sed 's/^Files \(.*\) and .* differ$/\1/'`
do
mv $old_file $old_file.old
done
IFS=$OLD_IFS
Thanks.
I have implemented a similar algorithm in Java, which essentially boils down to this:
Retrieve a listing of directories A and B, e.g. A.lst and B.lst
Create the intersection of both listings (e.g. cat A.lst B.lst | sort | uniq -d). This is the list of files you need to actually compare; you will also have to descend to any directories recursively.
You may want to have a look at the conditional expressions supported by your shell (e.g. for bash) or by the test command. I would also suggest using cmp instead of diff.
Note: you need to consider what the proper action should be when you have a directory on one side and a file on the other with the same name.
Find the files that are only present in A (e.g. cat A.lst B.lst B.lst | sort | uniq -u) and copy them recursively (cp -a) to B.
Similarly, find the files that are only present in B and copy them recursively to A.
EDIT:
I forgot to mention a significant optimization: if you sort the file lists A.lst and B.lst beforehand, you can use comm instead of cat ... | sort | uniq ... to perform the set operations:
Intersection: comm -12 A.sorted.lst B.sorted.lst
Files that exist only in A: comm -23 A.sorted.lst B.sorted.lst
Files that exist only in B: comm -13 A.sorted.lst B.sorted.lst
There exists a ready-made solution (shell script), based on find (also using the same idea as yours), to synchronize two directories: https://github.com/Fitus/Zaloha.sh.
Documentation is here: https://github.com/Fitus/Zaloha.sh/blob/master/DOCUMENTATION.md.
Cheers

Resources