Substitution only on last matching line (perl one-liner) - macos

I have multiple files of the form
version 'aaa'
other 'bbb'
another 'ccc'
version 'ddd'
onemore 'eee'
Some have one version, others have multiple; same with the other keys, but the values never repeat. I’m using, as part of a bigger bash function, a perl one-liner to modify values
modify_value() {
key_to_modify="$1"
new_value="$2"
perl -i'' -pe "s|^(\s*)${key_to_modify} .*|\1${key_to_modify} ${new_value}|" "${file}"
}
The indentation on the lines varies and is unpredictable, but should be respected (hence the need for ^(\s*)). This function works great to an extent. I can do
modify_value "onemore" "fff"
And it will be correctly replaced in the text file. However, where it breaks down is where I have multiple keys with the same name (such as the aforementioned version), as this change will be made in all of them. In my particular case, I want the modification to be made always in the last case.
Since values are never repeated, so far what I have is
modify_value() {
key_to_modify="$1"
new_value="$2"
last_key=$(cat "${file}" | grep "^\s*${key_to_modify}" | tail -1 | perl -pe 's/^\s*//')
perl -i'' -pe "s|^(\s*)${last_key}|\1${key_to_modify} ${new_value}|" "${file}"
}
This works, but is a bit inelegant. Would it be possible to leverage the perl one-liner to act only on the latest occurrence of the match, instead?

You might be tempted to use Tie::File.
# Borodin's solution with the bug fixes I mention below.
perl -MTie::File -e'
$key = shift(#ARGV);
$val = shift(#ARGV);
$file = shift(#ARGV);
tie #f, "Tie::File", $file;
for (reverse #f) { last if s/^\s*\Q$key\E\s\K.*/$val/; }
' "$1" "$2" "$file"
For small files, Tie::File will provide a solution that's slower than the alternatives and that uses more memory than the alternatives
For large files, Tie::File will provide an abysmally slow solution to this problem, although it will use less memory than loading the entire file into memory.
You really can't do any worse than using Tie::File for this problem.
Here's an alternative:
perl -i -e'
$key = shift(#ARGV);
$val = shift(#ARGV);
my #f = reverse(<>);
for (#f) { last if s/^\s*\Q$key\E\s\K.*/$val/; }
print reverse(#f);
' "$1" "$2" "$file"
You could even avoid the double reversing by having the substitution operator find the last match.
# 5.14+
perl -0777 -i -e'
$key = shift(#ARGV);
$val = shift(#ARGV);
print <> =~ s/\A.*^\s*\Q$key\E\s\K[^\n]*/$val/smr;
' "$1" "$2" "$file"
or
perl -0777 -i -e'
$key = shift(#ARGV);
$val = shift(#ARGV);
$_ = <>;
s/\A.*^\s*\Q$key\E\s\K[^\n]*/$val/sm;
print;
' "$1" "$2" "$file"
or
perl -0777 -i -pe'
BEGIN {
$key = shift(#ARGV);
$val = shift(#ARGV);
}
s/\A.*^\s*\Q$key\E\s\K[^\n]*/$val/sm;
' "$1" "$2" "$file"
If memory is an issue, reverse the input using File::ReadBackwards (or a similarly efficient tool), change the first match, then reverse the output using File::ReadBackwards.
These solutions also fix the improper interpolation of $key_to_modify and $new_value into the Perl program (by passing the values as args).
These solutions also fix the improper interpolation of $key_to_modify into the regex (by using \Q).

I suggest you use Tie::File, which lets you access a file as an array of lines. Any modifications made to the array are reflected in the file. It has been a core module since version 8 of Perl 5, so it shouldn't need to be installed.
This one-liner works by checking each line of the file from the end to the beginning, and stopping as soon as a match is found. It looks okay, but I'm not in a position to test it at present.
perl -MTie::File -e"tie #f,'Tie::File',\"${file}\"; s/^\s*${key_to_modify}\s\K.*/${new_value}/ and last for reverse #f"

Related

Unscramble words Challenge - improve my bash solution

There is a Capture the Flag challenge
I have two files; one with scrambled text like this with about 550 entries
dnaoyt
cinuertdso
bda
haey
tolpap
...
The second file is a dictionary with about 9,000 entries
radar
ccd
gcc
fcc
historical
...
The goal is to find the right, unscrambled version of the word, which is contained in the dictionary file.
My approach is to sort the characters from the first word from the first file and then look up if the first word from the second file has the same length. If so then sort that too and compare them.
This is my fully functional bash script, but it is very slow.
#!/bin/bash
while IFS="" read -r p || [ -n "$p" ]
do
var=0
ro=$(echo $p | perl -F -lane 'print sort #F')
len_ro=${#ro}
while IFS="" read -r o || [ -n "$o" ]
do
ro2=$(echo $o | perl -F -lane 'print sort # F')
len_ro2=${#ro2}
let "var+=1"
if [ $len_ro == $len_ro2 ]; then
if [ $ro == $ro2 ]; then
echo $o >> new.txt
echo $var >> whichline.txt
fi
fi
done < dictionary.txt
done < scrambled-words.txt
I have also tried converting all characters to ASCII integers and sum each word, but while comparing I realized that the sum of a different char pattern may have the same sum.
[edit]
For the records:
- no anagrams contained in dictionary
- to get the flag, you need to export the unscrambled words as one blob and ans make a SHA-Hash out of it (thats the flag)
- link to ctf for guy who wanted the files https://challenges.reply.com/tamtamy/user/login.action
You're better off creating a lookup dictionary (keyed by the sorted word) from the dictionary file.
Your loop body is executed 550 * 9,000 = 4,950,000 times (O(N*M)).
The solution I propose executes two loops of at most 9,000 passes each (O(N+M)).
Bonus: It finds all possible solutions at no cost.
#!/usr/bin/perl
use strict;
use warnings qw( all );
use feature qw( say );
my $dict_qfn = "dictionary.txt";
my $scrambled_qfn = "scrambled-words.txt";
sub key { join "", sort split //, $_[0] }
my %dict;
{
open(my $fh, "<", $dict_qfn)
or die("Can't open \"$dict_qfn\": $!\n");
while (<$fh>) {
chomp;
push #{ $dict{key($_)} }, $_;
}
}
{
open(my $fh, "<", $scrambled_qfn)
or die("Can't open \"$scrambled_qfn\": $!\n");
while (<$fh>) {
chomp;
my $matches = $dict{key($_)};
say "$_ matches #$matches" if $matches;
}
}
I wouldn't be surprised if this only takes one millionths of the time of your solution for the sizes you provided (and it scales so much better than yours if you were to increase the sizes).
I would do something like this with gawk
gawk '
NR == FNR {
dict[csort()] = $0
next
}
{
print dict[csort()]
}
function csort( chars, sorted) {
split($0, chars, "")
asort(chars)
for (i in chars)
sorted = sorted chars[i]
return sorted
}' dictionary.txt scrambled-words.txt
Here's perl-free solution I came up with using sort and join:
sort_letters() {
# Splits each letter onto a line, sorts the letters, then joins them
# e.g. "hello" becomes "ehllo"
echo "${1}" | fold-b1 | sort | tr -d '\n'
}
# For each input file...
for input in "dict.txt" "words.txt"; do
# Convert each line to [sorted] [original]
# then sort and save the results with a .sorted extension
while read -r original; do
sorted=$(sort_letters "${original}")
echo "${sorted} ${original}"
done < "${input}" | sort > "${input}.sorted"
done
# Join the two files on the [sorted] word
# outputting the scrambled and unscrambed words
join -j 1 -o 1.2,2.2 "words.txt.sorted" "dict.txt.sorted"
I tried something very alike, but a bit different.
#!/bin/bash
exec 3<scrambled-words.txt
while read -r line <&3; do
printf "%s" ${line} | perl -F -lane 'print sort #F'
done>scrambled-words_sorted.txt
exec 3>&-
exec 3<dictionary.txt
while read -r line <&3; do
printf "%s" ${line} | perl -F -lane 'print sort #F'
done>dictionary_sorted.txt
exec 3>&-
printf "" > whichline.txt
exec 3<scrambled-words_sorted.txt
while read -r line <&3; do
counter="$((++counter))"
grep -n -e "^${line}$" dictionary_sorted.txt | cut -d ':' -f 1 | tr -d '\n' >>whichline.txt printf "\n" >>whichline.txt
done
exec 3>&-
As you can see I don't create a new.txt file; instead I only create whichline.txt with a blank line where the word doesn't match. You can easily paste them up to create new.txt.
The logic behind the script is nearly the logic behind yours, with the exception that I called perl less times and I save two support files.
I think (but I am not sure) that creating them and cycle only one file will be better than ~5kk calls of perl. This way "only" ~10k times is called.
Finally, I decided to use grep because it's (maybe) the fastest regex matcher, and searching for the entire line the lenght is intrinsic in the regex.
Please, note that what #benjamin-w said is still valid and, in that case, grep will reply badly and I did not managed it!
I hope this could help [:

bash: transform scaffold fasta

I have a fasta file with the following sequences:
>NZ_OCNF01123018.1
TACAAATACAACAAATACAAGTACACCAAGTACAAATACAAGTATCCCAAGTACAAATACAAGTA
TCCCAAGTACAAATACAAGTATTCCAAGTACAAATACAAAACCTGTTGAGCAACCTAAACCTGTTGAAC
AGCCCAAACCTGTTGAACAGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAACCTTTATCCGCACTTA
CGAGCAAATACACCAATACCGCTTTATCGGCACAGTCTGCCCAAATTGACGGATGCACCATGTTACCCAACAC
ATCAATCAACGTTTGTGGGATCACCTGAAAAAGGGCGCGGTTTGTGGTTGATG
>NZ_OCNF01123018.2
AATTGTCGTGTAAAGCCACACCAAACCCCATTATAGCCCCAAAAACACCAAAAAGGCTGCCTGAACCACATTTCAGACAG
And I want to split the all sequences in the file that contain multiple N at the site where it occurs and make two sequences out of it.
Expected solution:
>NZ_OCNF01123018.1
TACAAATACAACAAATACAAGTACACCAAGTACAAATACAAGTATCCCAAGTACAAATACAAGTA
TCCCAAGTACAAATACAAGTATTCCAAGTACAAATACAAAACCTGTTGAGCAACCTAAACCTGTTGAAC
AGCCCAAACCTGTTGAACAGC
>contig1
AAACCTTTATCCGCACTTA
CGAGCAAATACACCAATACCGCTTTATCGGCACAGTCTGCCCAAATTGACGGATGCACCATGTTACCCAACAC
ATCAATCAACGTTTGTGGGATCACCTGAAAAAGGGCGCGGTTTGTGGTTGATG
>NZ_OCNF01123018.2
AATTGTCGTGTAAAGCCACACCAAACCCCATTATAGCCCCAAAAACACCAAAAAGGCTGCCTGAACCACATTTCAGACAG
my (inelegant) approach would be this:
perl -pe 's/[N]+/\*/g' $file | perl -pe 's/\*/\n>contig1\n/g'
of course that also replaces the N of the sequence header and creates headers without a sequence. As a plus, it would be nice to number the new 'contigs' from 1 to x in case there are multiple sequences with N.
What do you suggest?
I'd suggest to use split instead of trying to get a regex just right, and in a script instead of a brittle and crammed "one"-liner.
use warnings;
use strict;
use feature 'say';
my $file = shift #ARGV;
die "Usage: $0 filename\n" if !$file; # also check submitted $file
my $content = do { # or: my $content = Path::Tiny::path($file)->slurp;
local $/;
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>;
};
my #f = grep { /\S/ } split /(?<!>)NN+/, $content;
say shift #f;
my $cnt;
for (#f) {
say "\n>contig", (++$cnt), ":\n$_";
}
This slurps the file into $content since NN+ can span multiple lines; Path::Tiny module can make that cleaner. The first element of the obtained array needs no >contig so it is shifted away.
The negative lookbehind (?<!...) makes the regex in split's separator pattern match NN+ only when not preceded by >, thus protecting (excluding) header lines that may start with that. If headers may contain consecutive N which are not right after > then you need to refine this.
I expaned your perl one-liner a bit:
cat file.fasta | \
perl -pe 's/\n//g unless /^>/; s/>/\n>/g;' | \
perl -pe 's/N+(?{$n++})/\n>contig${n}\n/g unless /^>/'
the first part is to remove newlines between bases, the second part is to replace continuous 'N'.

Using sed on text files with a csv

I've been trying to do bulk find and replace on two text files using a csv. I've seen the questions that SO suggests, and none seem to answer my question.
I've created two variables for the two text files I want to modify. The csv has two columns and hundreds of rows. The first column contains strings (none have whitespaces) already in the text file that need to be replaced with the corresponding strings in same row in the second column.
As a test, I tried the script
#!/bin/bash
test1='long_file_name.txt'
find='string1'
replace='string2'
sed -e "s/$find/$replace/g" $test1 > $test1.tmp && mv $test1.tmp $test1
This was successful, except that I need to do it once for every row in the csv, using the values given by the csv in each row. My hunch is that my while loop was used wrongly, but I can't find the error. When I execute the script below, I get the command line prompt, which makes me think that something has happened. When I check the text files, nothing's changed.
The two text files, this script, and the csv are all in the same folder (it's also been my working directory when I do this).
#!/bin/bash
textfile1='long_file_name1.txt'
textfile2='long_file_name2.txt'
while IFS=, read f1 f2
do
sed -e "s/$f1/$f2/g" $textfile1 > $textfile1.tmp && \
mv $textfile1.tmp $textfile1
sed -e "s/$f1/$f2/g" $textfile2 > $textfile2.tmp && \
mv $textfile2.tmp $textfile2
done <'findreplace.csv'
It seems to me that this code should do what I want it to do (but doesn't); perhaps I'm misunderstanding something fundamental (I'm new to bash scripting)?
The csv looks like this, but with hundreds of rows. All a_i's should be replaced with their counterpart b_i in the next column over.
a_1 b_1
a_2 b_2
a_3 b_3
Something to note: All the strings actually contain underscores, just in case this affects something. I've tried wrapping the variable name in braces a la ${var}, but it still doesn't work.
I appreciate the solutions, but I'm also curious to know why the above doesn't work. (Also, I would vote everyone up, but I lack the reputation to do so. However, know that I appreciate and am learning a lot from your answers!)
If you are going to process lot of data and your patterns can contain a special character I would consider using Perl. Especially if you are going to have a lot of pairs in findreplace.csv. You can use following script as filter or in-place modification with lot of files. As side effect, it will load replacements and create Aho-Corrasic automaton only once per invocation which will make this solution pretty efficient (O(M+N) instead of O(M*N) in your solution).
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $in_place = ( #ARGV and $ARGV[0] =~ /^-i(.*)/ )
? do {
shift;
my $backup_extension = $1;
my $backup_name = $backup_extension =~ /\*/
? sub { ( my $fn = $backup_extension ) =~ s/\*/$_[0]/; $fn }
: sub { shift . $backup_extension };
my $oldargv = '-';
sub {
if ( $ARGV ne $oldargv ) {
rename( $ARGV, $backup_name->($ARGV) );
open( ARGVOUT, '>', $ARGV );
select(ARGVOUT);
$oldargv = $ARGV;
}
};
}
: sub { };
die "$0: File with replacements required." unless #ARGV;
my ( $re, %replace );
do {
my $filename = shift;
open my $fh, '<', $filename;
%replace = map { chomp; split ',', $_, 2 } <$fh>;
close $fh;
$re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;
};
while (<>) {
$in_place->();
s/$re/$replace{$1}/g;
}
continue {print}
Usage:
./replace.pl replace.csv <file.in >file.out
as well as
./replace.pl replace.csv file.in >file.out
or in-place
./replace.pl -i replace.csv file1.csv file2.csv file3.csv
or with backup
./replace.pl -i.orig replace.csv file1.csv file2.csv file3.csv
or with backup whit placeholder
./replace.pl -ithere.is.\*.original replace.csv file1.csv file2.csv file3.csv
You should convert your CSV file to a sed.script with the following command:
cat replace.csv | awk -F, '{print "s/" $1 "/" $2 "/g";}' > sed.script
And then you will be able to do a one pass replacement:
sed -i -f sed.script longfilename.txt
This will be a faster implementation of what you wanna do.
BTW, sorry, but I do not understand what is wrong with your script which should work except if your CSV file has more than 2 columns.

Perl print single quotes from command line script

The following
echo text | perl -lnE 'say "word: $_\t$_"'
prints
word: text text
I need
word: 'text' 'text'
Tried:
echo text | perl -lnE 'say "word: \'$_\' \'$_\'";' #didn't works
echo text | perl -lnE 'say "word: '$_' '$_'";' #neither
How to correctly escape the single quotes for bash?
Edit:
want prepare a shell script with a couple of mv lines (for checking, before really renames the files), e.g tried to solve the following:
find . type f -print | \
perl \
-MText::Unaccent::PurePerl=unac_string \
-MUnicode::Normalize=NFC -CASD \
-lanE 'BEGIN{$q=chr(39)}$o=$_;$_=unac_string(NFC($_));s/[{}()\[\]\s\|]+/_/g;say "mv $q$o$q $_"' >do_rename
e.g. from the filenames like:
Somé filénamé ČŽ (1980) |Full |Movie| Streaming [360p] some.mp4
want get the following output in the file do_rename
mv 'Somé filénamé ČŽ (1980) |Full |Movie| Streaming [360p] some.mp4' Some_filename_CZ_1980_Full_Movie_Streaming_360p_some.mp4
and after the manual inspection want run:
bash do_rename
for running the actual rename...
You can use ASCII code 39 for ' to avoid escape hell,
echo text | perl -lnE 'BEGIN{ $q=chr(39) } say "word: $q$_$q\t$q$_$q"'
You can use:
echo text | perl -lnE "say \"word: '\$_'\t'\$_'\""
word: 'text' 'text'
BASH allows you to include escaped double quote inside a double quote but same doesn't apply for single quoted. However while doing so we need to escape $ to avoid escaping from BASH.
OK, based on the statement of the problem you're having. My suggestion would be - don't pipe find to perl, that's just asking for all kinds of annoyance.
I'm not entirely familiar with the modules, but would suggest you try something like this:
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
use Text::Unaccent::PurePerl qw ( unac_string );
use Unicode::Normalize qw ( NFC );
use Getopt::Std;
use File::Copy qw ( move );
use Encode qw(decode_utf8);
my %opts;
#x to execute, p to specify a path.
#p mandatory.
getopts('xp:',\%opts);
#this sub is called for each file, by the find function.
#$File::Find::name is the full path to the file.
#$_ is just the filename.
sub rename_unicode_files {
#skip if it's not a file.
next unless -f $File::Find::name;
#convert name with functions from your example.
my $newname = unac_string(NFC(decode_utf8($File::Find::name)));
$newname =~ s/[{}()\[\]\s\|]+/_/g;
#could apply other transforms here, such as regular expressions.
#if the two names are different, consider moving.
unless ( $newname eq $File::Find::Name ) {
print "Would rename: $File::Find::Name to $newname\n";
#actually do it, if '-x' is specified.
if ( $opts{x} ) { move ( $File::Find::name, $newname ); };
}
}
#require -p <pathname> or otherwise print how to use.
unless ( -d $opts{p} ) {
print "Usage: $0 -p <pathname> [-x]\n";
exit;
}
#trigger find with callback to subroutine, over the '-p <path>'.
find ( \&rename_unicode_files, $opts{p} );
Extend with something like GetOpt::Std to check if you've specified an option - so you run normally, you get 'this is what I would do' and if you specify a particular flag, it actually does it.
And either use the perl builtin rename or the one available from File::Copy
This will neatly avoid a lot of the escaping and interpolating problems you're having, and I think leave you with generally more readable and useful code.
Edit: Given a comment suggesting that the above is 'too long' how about:
#!/usr/bin/perl
use File::Find; use Text::Unaccent::PurePerl qw ( unac_string ); use Unicode::Normalize qw ( NFC ); find( sub { next unless -f $name; print "mv \'$File::Find::Name\' \'",unac_string( NFC($File::Find::name) )."\'\n"; }, "." );
Still not convinced of the values of the approach. Even if it is only run occasionally - that's even more reason to make it as clear as possible.
There is a trivial solution using zsh shell and SQL-like (at least PostgreSQL and Oracle) quoting style:
$ setopt rc_quotes
$ echo text | perl -lnE 'say "word: ''$_''\t''$_''"'
word: 'text' 'text'
To quote a ' you simply double it and use '' in this mode.

Need to pick Latest File From a Dir Using Shell Script

I am new to Shell Script and I got a requirement to pick the latest files from a dir using Shell script
Directory Name : FTPDIR
File In this Dir will be of
APC5502015VP072020121826.csv
APC5502015VP082020122314.csv
APC5502015VP092020121451.csv
CBC5502015VP092020122045.csv
CBC5502015VP102020122045.csv
S5502015VP072020121620.csv
S5502015VP072020122314.csv
S5502015VP092020122045.csv
Note: (Need to Pick one Latest from each Group)- Below is the out put which I need to get after executing the shell script
APC5502015VP092020121451.csv
CBC5502015VP102020122045.csv
S5502015VP092020122045.csv
Ex: In the latest File APC5502015VP092020121451.csv the no 092020121451 is the date part in the format : MMDDYYYYHHMM and string part is APC5502015VP (Length Not Fixed in String Part)
I need to pick those three files from the dir using shell script
Can you help me to resolve this?
It's going to be really problematic to do this safely in just bash. As Jonathan mentioned, "special" characters like spaces or newlines may bung up your script.
If we can assume that there won't be any of those, then we can do most of job in bash, without involving other tools.
# Make an associative array to record types, in the second loop...
declare -A a
for file in *.csv; do
# First, we convert the filenames into something that can be sorted.
# The next three lines account for your "unknown length" in the first part
# of the filename. We assume the date+time is the 12 chars before ".csv".
new="$(rev <<<"$file")"
new="${new:4:12}"
new="$(rev <<<"$new")"
new="${new:4:4}${new:0:2}${new:2:2}${new:8:4}"
len=$(( ${#file} - 16 ))
echo "$new ${file:0:$len} $file"
done | sort | while read date type file; do
# Next, we print only the first of each "type"...
if [[ ${a[$type]} -eq 0 ]]; then
a[$type]=1
echo "$file"
fi
# And stop once we have collected three types.
if [[ ${#a[*]} -ge 3 ]]; then
break
fi
done
As I say, this doesn't handle newlines in filenames.
Note also that this uses rev and sort, which are not built in to bash. The rev parts could be done internally, using more code, which might make them execute faster, but you'd only see a difference in very extreme cases. There's not much we can do about sort, since there isn't a built-in within bash.
This Perl script works on the given data. No doubt it could be improved.
#!/usr/bin/env perl
use strict;
use warnings;
my %bases;
while (<>)
{
chomp;
my $name = $_;
my($prefix, $mmdd, $yyyy, $hhmm) = ($name =~ m/(.*)(\d{4})(\d{4})(\d{4})\.csv/);
#print "$name = $prefix $yyyy $mmdd $hhmm\n";
my $stamp = "$yyyy$mmdd$hhmm";
if (!exists($bases{$prefix}) || ($stamp > $bases{$prefix}->{stamp}))
{
$bases{$prefix} = { name => $name, stamp => $stamp };
}
}
foreach my $prefix (sort keys %bases)
{
print "$bases{$prefix}->{name}\n";
}
Output:
APC5502015VP092020121451.csv
CBC5502015VP102020122045.csv
S5502015VP092020122045.csv
this is the awk solution:
cd FTPDIR
ls -1|awk -F"VP" '{split($2,a,".");if(a[1]>b[$1]){b[$1]=$2}}END{for(i in b)print i"VP"b[i]}'
Testted Below:
> cat temp
APC5502015VP072020121826.csv
APC5502015VP082020122314.csv
APC5502015VP092020121451.csv
CBC5502015VP092020122045.csv
CBC5502015VP102020122045.csv
S5502015VP072020121620.csv
S5502015VP072020122314.csv
S5502015VP092020122045.csv
> awk -F"VP" '{split($2,a,".");if(a[1]>b[$1]){b[$1]=$2}}END{for(i in b)print i"VP"b[i]}' temp
CBC5502015VP102020122045.csv
S5502015VP092020122045.csv
APC5502015VP092020121451.csv

Resources