Sort files by basename - sorting

After a find, I'd like to sort the output by the basename (the number of directories is unknown). I know this can be done by splitting the basename from the dirname and sorting that, but I'm specifically looking for something where it's not necessary to modify the data before the sort. Something like sort --field-separator='/' -k '-1'.

For this task, I'd turn to perl and the use of a custom sort function. Save the perl code below as basename_sort.pl, chmod it 0755, then you can execute a command such as you've requested, as:
find | grep "\.php" | ./basename_sort.pl
Of course, you'll want to move that utility somewhere if you're doing it very often. Better yet, I'd also recommend wrapping a function around it within your .bashrc file. (staying on topic, sh code for that not included)
#!/usr/bin/perl
use strict;
my #lines = <STDIN>;
#lines = sort basename_sort #lines;
foreach( #lines ) {
print $_;
}
sub basename_sort() {
my #data1 = split('/', $a);
my #data2 = split('/', $b);
my $name1 = $data1[#data1 - 1];
my $name2 = $data2[#data2 - 1];
return lc($name1) cmp lc($name2);
}

This can be written shorter.
find | perl -e 'print sort{($p=$a)=~s!.*/!!;($q=$b)=~s!.*/!!;$p cmp$q}<>'

Ended up with a solution of simply moving the base name to the start of the string, sorting, and moving it back. Not really what I was hoping for, but it works even with weirdzo file names.

Related

Find & Merge log files in a directory based on rotation

In a directory of log files are rotated daily by date FILEX*date +%F-%H%M.LOG and placed in a directory...
I am attempting to de-clutter the directory since I have too many files and merge the file by date.
Everyday I have at 2 files call it FILE A and B on different nodes. For example today....
Content is as follow (not actual, but for illustration purpose)
FILEA.2019-07-18-1701.LOG
111AAA
222BBB
FILEB.2019-07-18-1703.LOG
333CCC
444DDD
After merging FILEAdate.LOG and FILEBdate.LOG are removed/deleted.
Manual way:
cat fileA fileB > FILEC.date +%F-%H%M.LOG
I started writing the following code but stuck on how to proceed since it is returning filenames but I don't know how to pick the them by date and merge.
#!/usr/bin/perl
use strict;
use warnings;
opendir(DIR, "/mydirectory/");
my #files = grep(/\*.*LOG$/,readdir(DIR));
closedir(DIR);
foreach my $file (#files) {
print "$file\n";
}
Above only prints the files in the directory.
FILEA.2019-07-18-1701.LOG
FILEB.2019-07-18-1703.LOG
more...from older dates.
the print returns all my logs directory. I planned to place them in an array, sort them by date and merge two... but that where I am stuck with how to proceed with the logic... [ either shell or perl help will do]
Expected output after combining the two files...
111AAA
222BBB
333CCC
444DDD
Sorting the files by the date part of the filename can be done using what is called the Schwartzian transform, named after Perl god Randal L. Schwartz who invented it.
Here is a script that sorts the filenames by date and then prints a suggested command to do with them. I assume you'll be able to adjust the rest to match your needs.
Also, to list files in a directory, it is easiest to use builtin function glob(), and probably most efficient too.
#!/usr/bin/perl
use strict;
use warnings;
my $dir="/mydirectory";
my #files = glob "$dir/FILE[AB]*.LOG";
# Schwartzian transform to sort by the date part of the file name
my #sorted_files =
# return just the file name:
map { $_->[0] }
# sort by date, then whole file name:
sort { $a->[1] cmp $b->[1] or $a->[0] cmp $b->[0] }
# build a pair [filename, date] for each file, with date as "" when none found:
map { $_ =~ /(\d{4}-\d{2}-\d{2})/; [$_, $1 || ""] }
#files;
foreach my $file (#sorted_files) {
print "$file\n";
my $outfile = $file;
# construct your output file name as you need - I'm not sure what you
# want to do with timestamps since in your example, FILEA and FILEB had
# different timestamps
$outfile =~ s/[^\/]*(\d{4}-\d{2}-\d{2}).*/FILEC.$1.LOG/;
print "cat $file >> $outfile\n";
# Uncomment this once you're confident it's doing the right thing:
#system("cat $file >> $outfile");
#unlink($file); # Not reversible... Safer to clean up by hand instead?
}
Important note: I wrote the glob patterns is such a way that it would not match FILEC*, because otherwise the commented-out lines (systemandunlink`) could destroy your logs completely if your uncommented them and ran the script twice.
Of course, you can make all this a lot more concise once you're comfortable with the construct:
#!/usr/bin/perl
use strict;
use warnings;
my #files =
map { $_->[0] }
sort { $a->[1] cmp $b->[1] or $a->[0] cmp $b->[0] }
map { $_ =~ /(\d{4}-\d{2}-\d{2})/; [$_, $1 || ""] }
glob "/mydirectory/FILE[AB]*.LOG";
foreach my $file (#files) {
...
}

bash: transform scaffold fasta

I have a fasta file with the following sequences:
>NZ_OCNF01123018.1
TACAAATACAACAAATACAAGTACACCAAGTACAAATACAAGTATCCCAAGTACAAATACAAGTA
TCCCAAGTACAAATACAAGTATTCCAAGTACAAATACAAAACCTGTTGAGCAACCTAAACCTGTTGAAC
AGCCCAAACCTGTTGAACAGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAACCTTTATCCGCACTTA
CGAGCAAATACACCAATACCGCTTTATCGGCACAGTCTGCCCAAATTGACGGATGCACCATGTTACCCAACAC
ATCAATCAACGTTTGTGGGATCACCTGAAAAAGGGCGCGGTTTGTGGTTGATG
>NZ_OCNF01123018.2
AATTGTCGTGTAAAGCCACACCAAACCCCATTATAGCCCCAAAAACACCAAAAAGGCTGCCTGAACCACATTTCAGACAG
And I want to split the all sequences in the file that contain multiple N at the site where it occurs and make two sequences out of it.
Expected solution:
>NZ_OCNF01123018.1
TACAAATACAACAAATACAAGTACACCAAGTACAAATACAAGTATCCCAAGTACAAATACAAGTA
TCCCAAGTACAAATACAAGTATTCCAAGTACAAATACAAAACCTGTTGAGCAACCTAAACCTGTTGAAC
AGCCCAAACCTGTTGAACAGC
>contig1
AAACCTTTATCCGCACTTA
CGAGCAAATACACCAATACCGCTTTATCGGCACAGTCTGCCCAAATTGACGGATGCACCATGTTACCCAACAC
ATCAATCAACGTTTGTGGGATCACCTGAAAAAGGGCGCGGTTTGTGGTTGATG
>NZ_OCNF01123018.2
AATTGTCGTGTAAAGCCACACCAAACCCCATTATAGCCCCAAAAACACCAAAAAGGCTGCCTGAACCACATTTCAGACAG
my (inelegant) approach would be this:
perl -pe 's/[N]+/\*/g' $file | perl -pe 's/\*/\n>contig1\n/g'
of course that also replaces the N of the sequence header and creates headers without a sequence. As a plus, it would be nice to number the new 'contigs' from 1 to x in case there are multiple sequences with N.
What do you suggest?
I'd suggest to use split instead of trying to get a regex just right, and in a script instead of a brittle and crammed "one"-liner.
use warnings;
use strict;
use feature 'say';
my $file = shift #ARGV;
die "Usage: $0 filename\n" if !$file; # also check submitted $file
my $content = do { # or: my $content = Path::Tiny::path($file)->slurp;
local $/;
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>;
};
my #f = grep { /\S/ } split /(?<!>)NN+/, $content;
say shift #f;
my $cnt;
for (#f) {
say "\n>contig", (++$cnt), ":\n$_";
}
This slurps the file into $content since NN+ can span multiple lines; Path::Tiny module can make that cleaner. The first element of the obtained array needs no >contig so it is shifted away.
The negative lookbehind (?<!...) makes the regex in split's separator pattern match NN+ only when not preceded by >, thus protecting (excluding) header lines that may start with that. If headers may contain consecutive N which are not right after > then you need to refine this.
I expaned your perl one-liner a bit:
cat file.fasta | \
perl -pe 's/\n//g unless /^>/; s/>/\n>/g;' | \
perl -pe 's/N+(?{$n++})/\n>contig${n}\n/g unless /^>/'
the first part is to remove newlines between bases, the second part is to replace continuous 'N'.

Using sed on text files with a csv

I've been trying to do bulk find and replace on two text files using a csv. I've seen the questions that SO suggests, and none seem to answer my question.
I've created two variables for the two text files I want to modify. The csv has two columns and hundreds of rows. The first column contains strings (none have whitespaces) already in the text file that need to be replaced with the corresponding strings in same row in the second column.
As a test, I tried the script
#!/bin/bash
test1='long_file_name.txt'
find='string1'
replace='string2'
sed -e "s/$find/$replace/g" $test1 > $test1.tmp && mv $test1.tmp $test1
This was successful, except that I need to do it once for every row in the csv, using the values given by the csv in each row. My hunch is that my while loop was used wrongly, but I can't find the error. When I execute the script below, I get the command line prompt, which makes me think that something has happened. When I check the text files, nothing's changed.
The two text files, this script, and the csv are all in the same folder (it's also been my working directory when I do this).
#!/bin/bash
textfile1='long_file_name1.txt'
textfile2='long_file_name2.txt'
while IFS=, read f1 f2
do
sed -e "s/$f1/$f2/g" $textfile1 > $textfile1.tmp && \
mv $textfile1.tmp $textfile1
sed -e "s/$f1/$f2/g" $textfile2 > $textfile2.tmp && \
mv $textfile2.tmp $textfile2
done <'findreplace.csv'
It seems to me that this code should do what I want it to do (but doesn't); perhaps I'm misunderstanding something fundamental (I'm new to bash scripting)?
The csv looks like this, but with hundreds of rows. All a_i's should be replaced with their counterpart b_i in the next column over.
a_1 b_1
a_2 b_2
a_3 b_3
Something to note: All the strings actually contain underscores, just in case this affects something. I've tried wrapping the variable name in braces a la ${var}, but it still doesn't work.
I appreciate the solutions, but I'm also curious to know why the above doesn't work. (Also, I would vote everyone up, but I lack the reputation to do so. However, know that I appreciate and am learning a lot from your answers!)
If you are going to process lot of data and your patterns can contain a special character I would consider using Perl. Especially if you are going to have a lot of pairs in findreplace.csv. You can use following script as filter or in-place modification with lot of files. As side effect, it will load replacements and create Aho-Corrasic automaton only once per invocation which will make this solution pretty efficient (O(M+N) instead of O(M*N) in your solution).
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $in_place = ( #ARGV and $ARGV[0] =~ /^-i(.*)/ )
? do {
shift;
my $backup_extension = $1;
my $backup_name = $backup_extension =~ /\*/
? sub { ( my $fn = $backup_extension ) =~ s/\*/$_[0]/; $fn }
: sub { shift . $backup_extension };
my $oldargv = '-';
sub {
if ( $ARGV ne $oldargv ) {
rename( $ARGV, $backup_name->($ARGV) );
open( ARGVOUT, '>', $ARGV );
select(ARGVOUT);
$oldargv = $ARGV;
}
};
}
: sub { };
die "$0: File with replacements required." unless #ARGV;
my ( $re, %replace );
do {
my $filename = shift;
open my $fh, '<', $filename;
%replace = map { chomp; split ',', $_, 2 } <$fh>;
close $fh;
$re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;
};
while (<>) {
$in_place->();
s/$re/$replace{$1}/g;
}
continue {print}
Usage:
./replace.pl replace.csv <file.in >file.out
as well as
./replace.pl replace.csv file.in >file.out
or in-place
./replace.pl -i replace.csv file1.csv file2.csv file3.csv
or with backup
./replace.pl -i.orig replace.csv file1.csv file2.csv file3.csv
or with backup whit placeholder
./replace.pl -ithere.is.\*.original replace.csv file1.csv file2.csv file3.csv
You should convert your CSV file to a sed.script with the following command:
cat replace.csv | awk -F, '{print "s/" $1 "/" $2 "/g";}' > sed.script
And then you will be able to do a one pass replacement:
sed -i -f sed.script longfilename.txt
This will be a faster implementation of what you wanna do.
BTW, sorry, but I do not understand what is wrong with your script which should work except if your CSV file has more than 2 columns.

Substitution only on last matching line (perl one-liner)

I have multiple files of the form
version 'aaa'
other 'bbb'
another 'ccc'
version 'ddd'
onemore 'eee'
Some have one version, others have multiple; same with the other keys, but the values never repeat. I’m using, as part of a bigger bash function, a perl one-liner to modify values
modify_value() {
key_to_modify="$1"
new_value="$2"
perl -i'' -pe "s|^(\s*)${key_to_modify} .*|\1${key_to_modify} ${new_value}|" "${file}"
}
The indentation on the lines varies and is unpredictable, but should be respected (hence the need for ^(\s*)). This function works great to an extent. I can do
modify_value "onemore" "fff"
And it will be correctly replaced in the text file. However, where it breaks down is where I have multiple keys with the same name (such as the aforementioned version), as this change will be made in all of them. In my particular case, I want the modification to be made always in the last case.
Since values are never repeated, so far what I have is
modify_value() {
key_to_modify="$1"
new_value="$2"
last_key=$(cat "${file}" | grep "^\s*${key_to_modify}" | tail -1 | perl -pe 's/^\s*//')
perl -i'' -pe "s|^(\s*)${last_key}|\1${key_to_modify} ${new_value}|" "${file}"
}
This works, but is a bit inelegant. Would it be possible to leverage the perl one-liner to act only on the latest occurrence of the match, instead?
You might be tempted to use Tie::File.
# Borodin's solution with the bug fixes I mention below.
perl -MTie::File -e'
$key = shift(#ARGV);
$val = shift(#ARGV);
$file = shift(#ARGV);
tie #f, "Tie::File", $file;
for (reverse #f) { last if s/^\s*\Q$key\E\s\K.*/$val/; }
' "$1" "$2" "$file"
For small files, Tie::File will provide a solution that's slower than the alternatives and that uses more memory than the alternatives
For large files, Tie::File will provide an abysmally slow solution to this problem, although it will use less memory than loading the entire file into memory.
You really can't do any worse than using Tie::File for this problem.
Here's an alternative:
perl -i -e'
$key = shift(#ARGV);
$val = shift(#ARGV);
my #f = reverse(<>);
for (#f) { last if s/^\s*\Q$key\E\s\K.*/$val/; }
print reverse(#f);
' "$1" "$2" "$file"
You could even avoid the double reversing by having the substitution operator find the last match.
# 5.14+
perl -0777 -i -e'
$key = shift(#ARGV);
$val = shift(#ARGV);
print <> =~ s/\A.*^\s*\Q$key\E\s\K[^\n]*/$val/smr;
' "$1" "$2" "$file"
or
perl -0777 -i -e'
$key = shift(#ARGV);
$val = shift(#ARGV);
$_ = <>;
s/\A.*^\s*\Q$key\E\s\K[^\n]*/$val/sm;
print;
' "$1" "$2" "$file"
or
perl -0777 -i -pe'
BEGIN {
$key = shift(#ARGV);
$val = shift(#ARGV);
}
s/\A.*^\s*\Q$key\E\s\K[^\n]*/$val/sm;
' "$1" "$2" "$file"
If memory is an issue, reverse the input using File::ReadBackwards (or a similarly efficient tool), change the first match, then reverse the output using File::ReadBackwards.
These solutions also fix the improper interpolation of $key_to_modify and $new_value into the Perl program (by passing the values as args).
These solutions also fix the improper interpolation of $key_to_modify into the regex (by using \Q).
I suggest you use Tie::File, which lets you access a file as an array of lines. Any modifications made to the array are reflected in the file. It has been a core module since version 8 of Perl 5, so it shouldn't need to be installed.
This one-liner works by checking each line of the file from the end to the beginning, and stopping as soon as a match is found. It looks okay, but I'm not in a position to test it at present.
perl -MTie::File -e"tie #f,'Tie::File',\"${file}\"; s/^\s*${key_to_modify}\s\K.*/${new_value}/ and last for reverse #f"

How can I sort after using a delimiter on the last field in bash scripting

for example
suppose that from a command let's call it "previous" we get a result, this result contains lines of text
now before printing out this text, I want to use the sort command in order to sort it using a delimiter.
in this case the delimiter is "*"
the thing is, I always want to sort on the last field for example if a line is like that
text*text***text*********text..*numberstext
I want my sort to sort using the last field, in this case on numberstext
if all lines were as the line I just posted, then it would be easy
I can just count the fields that are being created when using a delimiter(suppose we have N fields) and then apply this command
previous command | sort -t * -k N -n
but not all lines are in the same form, some line can be like that:
text:::***:*numberstext
as you can see, I always want to sort using the last field
basically I'm looking for a method to find the last field when using as a delimiter the character *
I was thinking that it might be like that
previous command | sort -t * -k $some_variable_denoting_the_ammount_of_fields -n
but I'm not sure if there's anything like that..
thanks :)
Use sed to duplicate the final field at the start of the line, sort, then use sed to remove the duplicate. Probably simpler to use your favourite programming language though.
Here is a perl script for it:
#!/usr/bin/perl
use strict;
use warnings;
my $regex = qr/\*([^*]*)$/o;
sub bylast
{
my $ak = ($a =~ $regex, $1) || "";
my $bk = ($b =~ $regex, $1) || "";
$ak cmp $bk;
}
print for sort bylast (<>);
This might work:
sed -r 's/.*\*([^*]+$)/\1###&/' source | sort | sed 's/^.*###//'
Add the last field to the front, sort it, delete the sort key N.B. ### can be anything you like as long as it does not exist in the source file.
Credit should go to #Havenless this is just his idea put into code

Resources