Find & Merge log files in a directory based on rotation - shell

In a directory of log files are rotated daily by date FILEX*date +%F-%H%M.LOG and placed in a directory...
I am attempting to de-clutter the directory since I have too many files and merge the file by date.
Everyday I have at 2 files call it FILE A and B on different nodes. For example today....
Content is as follow (not actual, but for illustration purpose)
FILEA.2019-07-18-1701.LOG
111AAA
222BBB
FILEB.2019-07-18-1703.LOG
333CCC
444DDD
After merging FILEAdate.LOG and FILEBdate.LOG are removed/deleted.
Manual way:
cat fileA fileB > FILEC.date +%F-%H%M.LOG
I started writing the following code but stuck on how to proceed since it is returning filenames but I don't know how to pick the them by date and merge.
#!/usr/bin/perl
use strict;
use warnings;
opendir(DIR, "/mydirectory/");
my #files = grep(/\*.*LOG$/,readdir(DIR));
closedir(DIR);
foreach my $file (#files) {
print "$file\n";
}
Above only prints the files in the directory.
FILEA.2019-07-18-1701.LOG
FILEB.2019-07-18-1703.LOG
more...from older dates.
the print returns all my logs directory. I planned to place them in an array, sort them by date and merge two... but that where I am stuck with how to proceed with the logic... [ either shell or perl help will do]
Expected output after combining the two files...
111AAA
222BBB
333CCC
444DDD

Sorting the files by the date part of the filename can be done using what is called the Schwartzian transform, named after Perl god Randal L. Schwartz who invented it.
Here is a script that sorts the filenames by date and then prints a suggested command to do with them. I assume you'll be able to adjust the rest to match your needs.
Also, to list files in a directory, it is easiest to use builtin function glob(), and probably most efficient too.
#!/usr/bin/perl
use strict;
use warnings;
my $dir="/mydirectory";
my #files = glob "$dir/FILE[AB]*.LOG";
# Schwartzian transform to sort by the date part of the file name
my #sorted_files =
# return just the file name:
map { $_->[0] }
# sort by date, then whole file name:
sort { $a->[1] cmp $b->[1] or $a->[0] cmp $b->[0] }
# build a pair [filename, date] for each file, with date as "" when none found:
map { $_ =~ /(\d{4}-\d{2}-\d{2})/; [$_, $1 || ""] }
#files;
foreach my $file (#sorted_files) {
print "$file\n";
my $outfile = $file;
# construct your output file name as you need - I'm not sure what you
# want to do with timestamps since in your example, FILEA and FILEB had
# different timestamps
$outfile =~ s/[^\/]*(\d{4}-\d{2}-\d{2}).*/FILEC.$1.LOG/;
print "cat $file >> $outfile\n";
# Uncomment this once you're confident it's doing the right thing:
#system("cat $file >> $outfile");
#unlink($file); # Not reversible... Safer to clean up by hand instead?
}
Important note: I wrote the glob patterns is such a way that it would not match FILEC*, because otherwise the commented-out lines (systemandunlink`) could destroy your logs completely if your uncommented them and ran the script twice.
Of course, you can make all this a lot more concise once you're comfortable with the construct:
#!/usr/bin/perl
use strict;
use warnings;
my #files =
map { $_->[0] }
sort { $a->[1] cmp $b->[1] or $a->[0] cmp $b->[0] }
map { $_ =~ /(\d{4}-\d{2}-\d{2})/; [$_, $1 || ""] }
glob "/mydirectory/FILE[AB]*.LOG";
foreach my $file (#files) {
...
}

Related

bash: transform scaffold fasta

I have a fasta file with the following sequences:
>NZ_OCNF01123018.1
TACAAATACAACAAATACAAGTACACCAAGTACAAATACAAGTATCCCAAGTACAAATACAAGTA
TCCCAAGTACAAATACAAGTATTCCAAGTACAAATACAAAACCTGTTGAGCAACCTAAACCTGTTGAAC
AGCCCAAACCTGTTGAACAGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAACCTTTATCCGCACTTA
CGAGCAAATACACCAATACCGCTTTATCGGCACAGTCTGCCCAAATTGACGGATGCACCATGTTACCCAACAC
ATCAATCAACGTTTGTGGGATCACCTGAAAAAGGGCGCGGTTTGTGGTTGATG
>NZ_OCNF01123018.2
AATTGTCGTGTAAAGCCACACCAAACCCCATTATAGCCCCAAAAACACCAAAAAGGCTGCCTGAACCACATTTCAGACAG
And I want to split the all sequences in the file that contain multiple N at the site where it occurs and make two sequences out of it.
Expected solution:
>NZ_OCNF01123018.1
TACAAATACAACAAATACAAGTACACCAAGTACAAATACAAGTATCCCAAGTACAAATACAAGTA
TCCCAAGTACAAATACAAGTATTCCAAGTACAAATACAAAACCTGTTGAGCAACCTAAACCTGTTGAAC
AGCCCAAACCTGTTGAACAGC
>contig1
AAACCTTTATCCGCACTTA
CGAGCAAATACACCAATACCGCTTTATCGGCACAGTCTGCCCAAATTGACGGATGCACCATGTTACCCAACAC
ATCAATCAACGTTTGTGGGATCACCTGAAAAAGGGCGCGGTTTGTGGTTGATG
>NZ_OCNF01123018.2
AATTGTCGTGTAAAGCCACACCAAACCCCATTATAGCCCCAAAAACACCAAAAAGGCTGCCTGAACCACATTTCAGACAG
my (inelegant) approach would be this:
perl -pe 's/[N]+/\*/g' $file | perl -pe 's/\*/\n>contig1\n/g'
of course that also replaces the N of the sequence header and creates headers without a sequence. As a plus, it would be nice to number the new 'contigs' from 1 to x in case there are multiple sequences with N.
What do you suggest?
I'd suggest to use split instead of trying to get a regex just right, and in a script instead of a brittle and crammed "one"-liner.
use warnings;
use strict;
use feature 'say';
my $file = shift #ARGV;
die "Usage: $0 filename\n" if !$file; # also check submitted $file
my $content = do { # or: my $content = Path::Tiny::path($file)->slurp;
local $/;
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>;
};
my #f = grep { /\S/ } split /(?<!>)NN+/, $content;
say shift #f;
my $cnt;
for (#f) {
say "\n>contig", (++$cnt), ":\n$_";
}
This slurps the file into $content since NN+ can span multiple lines; Path::Tiny module can make that cleaner. The first element of the obtained array needs no >contig so it is shifted away.
The negative lookbehind (?<!...) makes the regex in split's separator pattern match NN+ only when not preceded by >, thus protecting (excluding) header lines that may start with that. If headers may contain consecutive N which are not right after > then you need to refine this.
I expaned your perl one-liner a bit:
cat file.fasta | \
perl -pe 's/\n//g unless /^>/; s/>/\n>/g;' | \
perl -pe 's/N+(?{$n++})/\n>contig${n}\n/g unless /^>/'
the first part is to remove newlines between bases, the second part is to replace continuous 'N'.

Using sed on text files with a csv

I've been trying to do bulk find and replace on two text files using a csv. I've seen the questions that SO suggests, and none seem to answer my question.
I've created two variables for the two text files I want to modify. The csv has two columns and hundreds of rows. The first column contains strings (none have whitespaces) already in the text file that need to be replaced with the corresponding strings in same row in the second column.
As a test, I tried the script
#!/bin/bash
test1='long_file_name.txt'
find='string1'
replace='string2'
sed -e "s/$find/$replace/g" $test1 > $test1.tmp && mv $test1.tmp $test1
This was successful, except that I need to do it once for every row in the csv, using the values given by the csv in each row. My hunch is that my while loop was used wrongly, but I can't find the error. When I execute the script below, I get the command line prompt, which makes me think that something has happened. When I check the text files, nothing's changed.
The two text files, this script, and the csv are all in the same folder (it's also been my working directory when I do this).
#!/bin/bash
textfile1='long_file_name1.txt'
textfile2='long_file_name2.txt'
while IFS=, read f1 f2
do
sed -e "s/$f1/$f2/g" $textfile1 > $textfile1.tmp && \
mv $textfile1.tmp $textfile1
sed -e "s/$f1/$f2/g" $textfile2 > $textfile2.tmp && \
mv $textfile2.tmp $textfile2
done <'findreplace.csv'
It seems to me that this code should do what I want it to do (but doesn't); perhaps I'm misunderstanding something fundamental (I'm new to bash scripting)?
The csv looks like this, but with hundreds of rows. All a_i's should be replaced with their counterpart b_i in the next column over.
a_1 b_1
a_2 b_2
a_3 b_3
Something to note: All the strings actually contain underscores, just in case this affects something. I've tried wrapping the variable name in braces a la ${var}, but it still doesn't work.
I appreciate the solutions, but I'm also curious to know why the above doesn't work. (Also, I would vote everyone up, but I lack the reputation to do so. However, know that I appreciate and am learning a lot from your answers!)
If you are going to process lot of data and your patterns can contain a special character I would consider using Perl. Especially if you are going to have a lot of pairs in findreplace.csv. You can use following script as filter or in-place modification with lot of files. As side effect, it will load replacements and create Aho-Corrasic automaton only once per invocation which will make this solution pretty efficient (O(M+N) instead of O(M*N) in your solution).
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $in_place = ( #ARGV and $ARGV[0] =~ /^-i(.*)/ )
? do {
shift;
my $backup_extension = $1;
my $backup_name = $backup_extension =~ /\*/
? sub { ( my $fn = $backup_extension ) =~ s/\*/$_[0]/; $fn }
: sub { shift . $backup_extension };
my $oldargv = '-';
sub {
if ( $ARGV ne $oldargv ) {
rename( $ARGV, $backup_name->($ARGV) );
open( ARGVOUT, '>', $ARGV );
select(ARGVOUT);
$oldargv = $ARGV;
}
};
}
: sub { };
die "$0: File with replacements required." unless #ARGV;
my ( $re, %replace );
do {
my $filename = shift;
open my $fh, '<', $filename;
%replace = map { chomp; split ',', $_, 2 } <$fh>;
close $fh;
$re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;
};
while (<>) {
$in_place->();
s/$re/$replace{$1}/g;
}
continue {print}
Usage:
./replace.pl replace.csv <file.in >file.out
as well as
./replace.pl replace.csv file.in >file.out
or in-place
./replace.pl -i replace.csv file1.csv file2.csv file3.csv
or with backup
./replace.pl -i.orig replace.csv file1.csv file2.csv file3.csv
or with backup whit placeholder
./replace.pl -ithere.is.\*.original replace.csv file1.csv file2.csv file3.csv
You should convert your CSV file to a sed.script with the following command:
cat replace.csv | awk -F, '{print "s/" $1 "/" $2 "/g";}' > sed.script
And then you will be able to do a one pass replacement:
sed -i -f sed.script longfilename.txt
This will be a faster implementation of what you wanna do.
BTW, sorry, but I do not understand what is wrong with your script which should work except if your CSV file has more than 2 columns.

How to replace nth field in a csv file with mapped value from another file?

I have a csv file in the following format:
23:56:00,5,1,7,99,100,101
23:56:30,5,1,7,98,199,191
23:57:00,6,1,6,99,99,98
23:57:30,5,2,6,97,99,199
...
And a map file in the following format:
1:10
2:12
3:30
4:aa
5:16
6:11
7:bb
What I'm trying to accomplish is to replace the fields in columns 2,3 and 4 in the first csv files with the values they map to in the map file.
For example in the above case the final output I want is this:
23:56:00,16,10,bb,99,100,101
23:56:30,16,10,bb,98,199,191
23:57:00,11,10,11,99,99,98
23:57:30,16,12,11,97,99,199
What would be the best way to do this? I was trying to figure out a way using awk/sed but I'm not sure how to access multiple files inside awk, and if that is even the best way to do this. There will be a lot of repetitions since its a large file so I don't think checking for a mapping each time is the right way to do this.
Is there a way to store the map in to a hash table inside the shell script, and then replace using the hash mapping?
Try with:
awk '
BEGIN { FS = OFS = "," }
FNR == NR {
split($0, f, /:/)
map[f[1]] = f[2]
next
}
{
for (i=2; i<=4; i++) {
if ($i in map) { $i = map[$i] }
}
}
{ print }
' mapfile csvfile
It reads the map file first and saves data in an associative array that is compared with fields 2, 3 and 4 from the csv file. The result yields:
23:56:00,16,10,bb,99,100,101
23:56:30,16,10,bb,98,199,191
23:57:00,11,10,11,99,99,98
23:57:30,16,12,11,97,99,199
One pure Bash possibility (with Bash versionā‰„4):
Slurp the map file in an associative array and process your csv file:
#!/bin/bash
declare -A map=()
while IFS=: read -r k v; do
[[ -z "$k$v" ]] && continue # ignore empty lines
map[$k]=$v
done < mapfile.txt
IFS=,
while read -r -a ary; do
[[ -z "${ary[#]}" ]] && continue # ignore empty lines
ary[1]=${map[${ary[1]}]}
ary[2]=${map[${ary[2]}]}
ary[3]=${map[${ary[3]}]}
echo "${ary[*]}"
done < csvfile.txt
If the keys in your map file are non-negative integers, you don't need associative arrays, and just replace the line declare -A map=() with map=().
It might not be the most efficient since Bash is not the fastest to process data, but it works well!
Btw, there are no error checkings whatsoever, so be sure you apply this script to well-formated files.
On your example, this yields:
23:56:00,16,10,bb,99,100,101
23:56:30,16,10,bb,98,199,191
23:57:00,11,10,11,99,99,98
23:57:30,16,12,11,97,99,199
Perl solution. Hashes exist in the recent versions of bash, but I prefer a real programming language when working with them.
#!/usr/bin/perl
use warnings;
use strict;
open my $MAP, '<', '1.map' or die $!;
my %map;
while (<$MAP>) {
chomp;
my ($key, $value) = split /:/;
$map{$key} = $value;
}
open my $CSV, '<', '1.csv' or die $!;
while (<$CSV>) {
my #fields = split /,/;
s/(.*)/$map{$1}/ for #fields[1, 2, 3];
print join ',' => #fields;
}
Another awk
awk -F",|:" 'FNR==NR {a[$1]=$2;next} {print $1":"$2":"$3,a[$4],a[$5],a[$6],$7,$8,$9}' OFS=, map csv
23:56:00,16,10,bb,99,100,101
23:56:30,16,10,bb,98,199,191
23:57:00,11,10,11,99,99,98
23:57:30,16,12,11,97,99,199

Need to pick Latest File From a Dir Using Shell Script

I am new to Shell Script and I got a requirement to pick the latest files from a dir using Shell script
Directory Name : FTPDIR
File In this Dir will be of
APC5502015VP072020121826.csv
APC5502015VP082020122314.csv
APC5502015VP092020121451.csv
CBC5502015VP092020122045.csv
CBC5502015VP102020122045.csv
S5502015VP072020121620.csv
S5502015VP072020122314.csv
S5502015VP092020122045.csv
Note: (Need to Pick one Latest from each Group)- Below is the out put which I need to get after executing the shell script
APC5502015VP092020121451.csv
CBC5502015VP102020122045.csv
S5502015VP092020122045.csv
Ex: In the latest File APC5502015VP092020121451.csv the no 092020121451 is the date part in the format : MMDDYYYYHHMM and string part is APC5502015VP (Length Not Fixed in String Part)
I need to pick those three files from the dir using shell script
Can you help me to resolve this?
It's going to be really problematic to do this safely in just bash. As Jonathan mentioned, "special" characters like spaces or newlines may bung up your script.
If we can assume that there won't be any of those, then we can do most of job in bash, without involving other tools.
# Make an associative array to record types, in the second loop...
declare -A a
for file in *.csv; do
# First, we convert the filenames into something that can be sorted.
# The next three lines account for your "unknown length" in the first part
# of the filename. We assume the date+time is the 12 chars before ".csv".
new="$(rev <<<"$file")"
new="${new:4:12}"
new="$(rev <<<"$new")"
new="${new:4:4}${new:0:2}${new:2:2}${new:8:4}"
len=$(( ${#file} - 16 ))
echo "$new ${file:0:$len} $file"
done | sort | while read date type file; do
# Next, we print only the first of each "type"...
if [[ ${a[$type]} -eq 0 ]]; then
a[$type]=1
echo "$file"
fi
# And stop once we have collected three types.
if [[ ${#a[*]} -ge 3 ]]; then
break
fi
done
As I say, this doesn't handle newlines in filenames.
Note also that this uses rev and sort, which are not built in to bash. The rev parts could be done internally, using more code, which might make them execute faster, but you'd only see a difference in very extreme cases. There's not much we can do about sort, since there isn't a built-in within bash.
This Perl script works on the given data. No doubt it could be improved.
#!/usr/bin/env perl
use strict;
use warnings;
my %bases;
while (<>)
{
chomp;
my $name = $_;
my($prefix, $mmdd, $yyyy, $hhmm) = ($name =~ m/(.*)(\d{4})(\d{4})(\d{4})\.csv/);
#print "$name = $prefix $yyyy $mmdd $hhmm\n";
my $stamp = "$yyyy$mmdd$hhmm";
if (!exists($bases{$prefix}) || ($stamp > $bases{$prefix}->{stamp}))
{
$bases{$prefix} = { name => $name, stamp => $stamp };
}
}
foreach my $prefix (sort keys %bases)
{
print "$bases{$prefix}->{name}\n";
}
Output:
APC5502015VP092020121451.csv
CBC5502015VP102020122045.csv
S5502015VP092020122045.csv
this is the awk solution:
cd FTPDIR
ls -1|awk -F"VP" '{split($2,a,".");if(a[1]>b[$1]){b[$1]=$2}}END{for(i in b)print i"VP"b[i]}'
Testted Below:
> cat temp
APC5502015VP072020121826.csv
APC5502015VP082020122314.csv
APC5502015VP092020121451.csv
CBC5502015VP092020122045.csv
CBC5502015VP102020122045.csv
S5502015VP072020121620.csv
S5502015VP072020122314.csv
S5502015VP092020122045.csv
> awk -F"VP" '{split($2,a,".");if(a[1]>b[$1]){b[$1]=$2}}END{for(i in b)print i"VP"b[i]}' temp
CBC5502015VP102020122045.csv
S5502015VP092020122045.csv
APC5502015VP092020121451.csv

Sort files by basename

After a find, I'd like to sort the output by the basename (the number of directories is unknown). I know this can be done by splitting the basename from the dirname and sorting that, but I'm specifically looking for something where it's not necessary to modify the data before the sort. Something like sort --field-separator='/' -k '-1'.
For this task, I'd turn to perl and the use of a custom sort function. Save the perl code below as basename_sort.pl, chmod it 0755, then you can execute a command such as you've requested, as:
find | grep "\.php" | ./basename_sort.pl
Of course, you'll want to move that utility somewhere if you're doing it very often. Better yet, I'd also recommend wrapping a function around it within your .bashrc file. (staying on topic, sh code for that not included)
#!/usr/bin/perl
use strict;
my #lines = <STDIN>;
#lines = sort basename_sort #lines;
foreach( #lines ) {
print $_;
}
sub basename_sort() {
my #data1 = split('/', $a);
my #data2 = split('/', $b);
my $name1 = $data1[#data1 - 1];
my $name2 = $data2[#data2 - 1];
return lc($name1) cmp lc($name2);
}
This can be written shorter.
find | perl -e 'print sort{($p=$a)=~s!.*/!!;($q=$b)=~s!.*/!!;$p cmp$q}<>'
Ended up with a solution of simply moving the base name to the start of the string, sorting, and moving it back. Not really what I was hoping for, but it works even with weirdzo file names.

Resources