Sort directory list by subfolder number - bash

Is there a fast and smart way in bash (maybe with awk/sed/sort???) to sort the result of a find command by number of subfolders in the path and then alphabetically.
I mean something like
./a/
./b/
./c/
./a/a/
./a/python-script.py
./a/test.txt
./b/a/
./b/b/
./c/a/
./c/c/
./a/a/a/
./a/a/file.txt
./a/a/t/
...
...
I want to take the output of the find command and see first the filenames in the current folder, then the files in the first level of subfolders, then the files in the second level, and so on (if possible sorted alfabetically for each level).

You can use the printf statement in find and ask it to return the depth of the file %d. Then use sort on that and cut to remove the output:
$ find . -printf '%d\t%p\n' | sort -n | cut -f2-

I suppose this is much less elegant than #kvantour's answer, how about the Schwartzian transform in Perl:
find . -print0 | perl -0ne '
push(#list, $_);
END {
#sorted = map { $_->[0] }
sort { $a->[1] <=> $b->[1] or $a->[0] cmp $b->[0] }
map { [$_, tr#/#/#] } #list;
print join("\n", #sorted), "\n";
}'

Related

Find & Merge log files in a directory based on rotation

In a directory of log files are rotated daily by date FILEX*date +%F-%H%M.LOG and placed in a directory...
I am attempting to de-clutter the directory since I have too many files and merge the file by date.
Everyday I have at 2 files call it FILE A and B on different nodes. For example today....
Content is as follow (not actual, but for illustration purpose)
FILEA.2019-07-18-1701.LOG
111AAA
222BBB
FILEB.2019-07-18-1703.LOG
333CCC
444DDD
After merging FILEAdate.LOG and FILEBdate.LOG are removed/deleted.
Manual way:
cat fileA fileB > FILEC.date +%F-%H%M.LOG
I started writing the following code but stuck on how to proceed since it is returning filenames but I don't know how to pick the them by date and merge.
#!/usr/bin/perl
use strict;
use warnings;
opendir(DIR, "/mydirectory/");
my #files = grep(/\*.*LOG$/,readdir(DIR));
closedir(DIR);
foreach my $file (#files) {
print "$file\n";
}
Above only prints the files in the directory.
FILEA.2019-07-18-1701.LOG
FILEB.2019-07-18-1703.LOG
more...from older dates.
the print returns all my logs directory. I planned to place them in an array, sort them by date and merge two... but that where I am stuck with how to proceed with the logic... [ either shell or perl help will do]
Expected output after combining the two files...
111AAA
222BBB
333CCC
444DDD
Sorting the files by the date part of the filename can be done using what is called the Schwartzian transform, named after Perl god Randal L. Schwartz who invented it.
Here is a script that sorts the filenames by date and then prints a suggested command to do with them. I assume you'll be able to adjust the rest to match your needs.
Also, to list files in a directory, it is easiest to use builtin function glob(), and probably most efficient too.
#!/usr/bin/perl
use strict;
use warnings;
my $dir="/mydirectory";
my #files = glob "$dir/FILE[AB]*.LOG";
# Schwartzian transform to sort by the date part of the file name
my #sorted_files =
# return just the file name:
map { $_->[0] }
# sort by date, then whole file name:
sort { $a->[1] cmp $b->[1] or $a->[0] cmp $b->[0] }
# build a pair [filename, date] for each file, with date as "" when none found:
map { $_ =~ /(\d{4}-\d{2}-\d{2})/; [$_, $1 || ""] }
#files;
foreach my $file (#sorted_files) {
print "$file\n";
my $outfile = $file;
# construct your output file name as you need - I'm not sure what you
# want to do with timestamps since in your example, FILEA and FILEB had
# different timestamps
$outfile =~ s/[^\/]*(\d{4}-\d{2}-\d{2}).*/FILEC.$1.LOG/;
print "cat $file >> $outfile\n";
# Uncomment this once you're confident it's doing the right thing:
#system("cat $file >> $outfile");
#unlink($file); # Not reversible... Safer to clean up by hand instead?
}
Important note: I wrote the glob patterns is such a way that it would not match FILEC*, because otherwise the commented-out lines (systemandunlink`) could destroy your logs completely if your uncommented them and ran the script twice.
Of course, you can make all this a lot more concise once you're comfortable with the construct:
#!/usr/bin/perl
use strict;
use warnings;
my #files =
map { $_->[0] }
sort { $a->[1] cmp $b->[1] or $a->[0] cmp $b->[0] }
map { $_ =~ /(\d{4}-\d{2}-\d{2})/; [$_, $1 || ""] }
glob "/mydirectory/FILE[AB]*.LOG";
foreach my $file (#files) {
...
}

sorting a multiFASTA file by DNA length

I'm trying to sort a multiFASTA file by length. I have the alphabetical sort figured out but I can't seem to get the numerical sort. The output should be a sorted multiFASTA file. This is an option to another program. Here is the code.
sub sort {
my $length;
my $key;
my $id;
my %seqs;
my $seq;
my $action = shift;
my $match = $opts{$action};
$match =~ /[l|id]/ || die "not the right parameters\n";
my $in = Bio::SeqIO->new(-file=>"$filename", -format=>'fasta');
while(my $seqobj = $in->next_seq()){
my $id = $seqobj->display_id();
my $length=$seqobj->length();
#$seq =~s/.{1,60}\K/\n/sg;
$seqs{$id} = $seqobj, unless $match eq 'l';
$seqs{$length}=$seqobj, unless $match eq 'id';
}
if($match eq 'id'){
foreach my $id (sort keys %seqs) {
printf ">%-9s \n%-s\n", $id, $seqs{$id}->seq;
}
}
elsif($match eq 'l'){
foreach my $length ( sort keys %seqs){
printf "%-10s\n%-s\n",$length, $seqs{$length}->seq;
}
}
}
To sort numerically, you must provide the comparing subroutine:
sort { $a <=> $b } keys %seqs
Are you sure no two sequences can have the same length? $seqs{$length}=$seqobj overwrites the previously stored value.
A one lineer: use awk to linearize. a second awk to add a column containing the length, sort on this column, remove the column, restore the fasta sequence.
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '\t' '{printf("%d\t%s\n",length($2),$0);}' |\
sort -t $'\t' -k1,1n |\
cut -f 2- |\
tr "\t" "\n"
PS: for bioinformartics questions, you should use https://www.biostars.org/, or https://bioinformatics.stackexchange.com/, etc...
You can use pyfaidx or just take a look at jim hester repos. But as #pierre said above you should ask you question on biostars for example. The answer on biostars can be found here.

Sorting issue in Bash Script

I have a whole file full of filenames that is outputted from the find command below:
find "$ARCHIVE" -type f -name *_[0-9][0-9] | sed 's/_[0-9][0-9]$//' > temp
I am now trying to sort these file names and count them to find out which one appears the most. The problem I am having with this is whenever I execute:
sort -g temp
It prints all the sorted file names to the command line and I am unsure why. Any help with this issue would be greatly appreciated!
You may need this:
sort temp| uniq -c | sort -nr
First we sort temp, then we prefix lines by the number of occurrences (uniq -c), next we compare according to string numerical value (sort -n) and the last command reverse the result of comparisons (sort -r).
Example file:
/home/user/testfiles/405/prob405823
/home/user/testfiles/405/prob405823
/home/user/testfiles/527/prob527149
/home/user/testfiles/518/prob518433
Output:
2 /home/user/testfiles/405/prob405823
1 /home/user/testfiles/527/prob527149
etc..
Resources:
Linux / Unix Command: sort
uniq(1) - Linux man page
ptierno - comments to improve answer
You could do everything after the find in one awk command (this one uses GNU awk 4.*):
find "$ARCHIVE" -type f -name *_[0-9][0-9] |
awk '
{ cnt[gensub(/_[0-9][0-9]$/,"","")]++ }
END {
PROCINFO["sorted_in"] = "#val_num_desc"
for (file in cnt) {
print cnt, file
}
}
'

Sort files by basename

After a find, I'd like to sort the output by the basename (the number of directories is unknown). I know this can be done by splitting the basename from the dirname and sorting that, but I'm specifically looking for something where it's not necessary to modify the data before the sort. Something like sort --field-separator='/' -k '-1'.
For this task, I'd turn to perl and the use of a custom sort function. Save the perl code below as basename_sort.pl, chmod it 0755, then you can execute a command such as you've requested, as:
find | grep "\.php" | ./basename_sort.pl
Of course, you'll want to move that utility somewhere if you're doing it very often. Better yet, I'd also recommend wrapping a function around it within your .bashrc file. (staying on topic, sh code for that not included)
#!/usr/bin/perl
use strict;
my #lines = <STDIN>;
#lines = sort basename_sort #lines;
foreach( #lines ) {
print $_;
}
sub basename_sort() {
my #data1 = split('/', $a);
my #data2 = split('/', $b);
my $name1 = $data1[#data1 - 1];
my $name2 = $data2[#data2 - 1];
return lc($name1) cmp lc($name2);
}
This can be written shorter.
find | perl -e 'print sort{($p=$a)=~s!.*/!!;($q=$b)=~s!.*/!!;$p cmp$q}<>'
Ended up with a solution of simply moving the base name to the start of the string, sorting, and moving it back. Not really what I was hoping for, but it works even with weirdzo file names.

Bash and sort files in order

with a previous bash script I created a list of files:
data_1_box
data_2_box
...
data_10_box
...
data_99_box
the thing is that now I need to concatenate them, so I tried
ls -l data_*
but I get
.....
data_89_box
data_8_box
data_90_box
...
data_99_box
data_9_box
but I need to get in the sucession 1, 2, 3, 4, .. 9, ..., 89, 90, 91, ..., 99
Can it be done in bash?
ls data_* | sort -n -t _ -k 2
-n: sorts numerically
-t: field separator '_'
-k: sort on second field, in your case the numbers after the first '_'
How about using the -v flag to ls? The purpose of the flag is to sort files according to version number, but it works just as well here and eliminates the need to pipe the result to sort:
ls -lv data_*
If your sort has version sort, try:
ls -1 | sort -V
(that's a capital V).
This is a generic answer! You have to apply rules to the specific set of data
ls | sort
Example:
ls | sort -n -t _ -k 2
maybe you'll like SistemaNumeri.py ("fix numbers"): it renames your
data_1_box
data_2_box
...
data_10_box
...
data_99_box
in
data_01_box
data_02_box
...
data_10_box
...
data_99_box
Here's the way to do it in bash if your sort doesn't have version sort:
cat <your_former_ls_output_file> | awk ' BEGIN { FS="_" } { printf( "%03d\n",$2) }' | sort | awk ' { printf( "data_%d_box\n", $1) }'
All in one line. Keep in mind, I haven't tested this on your specific data, so it might need a little tweaking to work correctly for you. This outlines a good, robust and relatively simple solution, though. Of course, you can always swap the cat+filename in the beginning with an the actual ls to create the file data on the fly. For capturing the actual filename column, you can choose between correct ls parameters or piping through either cut or awk.
One suggestion I can think of is this :
for i in `seq 1 5`
do
cat "data_${i}_box"
done
I have files in a folder and need to sort them based on the number. E.g. -
abc_dr-1.txt
hg_io-5.txt
kls_er_we-3.txt
sd-4.txt
sl_rt_we_yh-2.txt
I need to sort them based on number.
So I used this to sort.
ls -1 | sort -t '-' -nk2

Resources