Efficiency of splitting very long transliteration in Perl - performance

I have this very long transliteration:
$text =~ tr/áàăâǎåǻäǟãȧǡąāȁȃɑʙƀɓƃćĉčċçȼƈɕʗďđðɖɗƌȡéèĕêěëėȩęēȅȇɇɛ/aaaaaaaaaaaaaaaaabbbbcccccccccdddddddeeeee/;
# Etc. (About 400 chars)
I want to split it into several transliterations since the resulting code would be easier to maintain:
$text =~ tr/áàăâǎåǻäǟãȧǡąāȁȃɑ/aaaaaaaaaaaaaaaaa/;
$text =~ tr/ʙƀɓƃ/bbbb/;
$text =~ tr/ćĉčċçȼƈɕʗ/ccccccccc/;
# Etc.
I believe that is going to slow things down, but I'd like to know for sure. This process runs about 1000 times per second on a pretty busy server.
Thanks.

You could build a transliterator:
my %translits = (
'áàăâǎåǻäǟãȧǡąāȁȃɑ' => 'a',
'ʙƀɓƃ' => 'b',
'ćĉčċçȼƈɕʗ' => 'c',
);
my $pat = '';
my $repl = '';
for (keys(%translit)) {
$pat .= $_;
$repl .= $translit{$_} x length($_);
}
my $tr1 = eval "sub { tr/\Q$pat\E/\Q$repl\E/ }" or die $#;
-or-
my $tr2 = eval "sub { \$_[0] =~ tr/\Q$pat\E/\Q$repl\E/ }" or die $#;
Then use it like this:
$tr1->() for $str;
-or-
$tr2->($str);
Of course, you could always use Text::Unidecode.

I would expect the second solution with three operations to be slower, because it re-scans characters in $text that have already been substituted.

Here is a benchmark:
use Benchmark qw(:all);
my $str = 'áàăâǎåǻäǟãȧǡąāȁȃɑʙƀɓƃćĉčċçȼƈɕʗďđðɖɗƌȡéèĕêěëėȩęēȅȇɇɛ/aaaaaaaaaaaaaaaaabbbbcccccccccdddddddeeeee';
my $count = -2;
cmpthese($count, {
'one tr' => sub {
$str =~ tr/áàăâǎåǻäǟãȧǡąāȁȃɑʙƀɓƃćĉčċçȼƈɕʗďđðɖɗƌȡéèĕêěëėȩęēȅȇɇɛ/aaaaaaaaaaaaaaaaabbbbcccccccccdddddddeeeee/;
},
'multi tr' => sub {
$str =~ tr/áàăâǎåǻäǟãȧǡąāȁȃɑ/aaaaaaaaaaaaaaaaa/;
$str =~ tr/ʙƀɓƃ/bbbb/;
$str =~ tr/ćĉčċçȼƈɕʗ/ccccccccc/;
$str =~ tr/ďđðɖɗƌȡ/ddddddd/;
$str =~ tr/éèĕêěëėȩęēȅȇɇɛ/eeeee/;
},
});
result:
Rate multi tr one tr
multi tr 1215538/s -- -81%
one tr 6271883/s 416% --
As we see, one tr is 5 times faster than multi-tr.

Related

How to grab an unknown pattern from a list in perl?

Is there a way to identify patterns in perl from a list? Lets say I have a file1:
123456
123abc456
123_abc_456
123_abc_d_456
abc_123_d_456
red123456blue
from this list I can clearly see that the pattern "123" and pattern "456" are common to all items in the list. So I want to store that as a variable.
Assume that the items in the list can only have characters (0-9)(a-z)(A-Z) and (_). Also assume that the pattern exists for every item in the list like in the example above. The pattern can be any length.
Doing it the hard way (brute-force) is quite forward:
use strict;
my #data=map {chomp; $_} (<DATA>);
my #candidates = (shift #data);
my %seen;
my #result;
OUT: while (#candidates) {
my $current = shift #candidates;
next unless $current;
foreach my $found (#result) {
next OUT
if $found =~ /$current/;
}
foreach my $ele (#data) {
unless ($ele =~ /$current/) {
foreach my $n (substr($current,1), substr($current,0,-1)) {
next if $seen{$n}++;
push #candidates, $n;
}
next OUT;
}
}
push #result, $current;
}
print Dumper (\#result);
__DATA__
123456
123abc456
123_abc_456
123_abc_d_456
abc_123_d_456
red123456blue
prints out:
$VAR1 = [
'456',
'123'
];
Removing the "inner loop"
foreach my $found (#result) {
next OUT
if $found =~ /$current/;
}
would find all matches:
$VAR1 = [
'456',
'123',
'45',
'23',
'4',
'3'
];

change bash script code to perl

In bash :
#!/bin/bash
var=$(cat ps.txt)
for i in $var ; do
echo $i
done
and ps.txt is :
356735
535687
547568537
7345673
3653468
2376958764
12345678
12345
Now I want to do that with perl or i want to know how to save the output of a command in a variable in perl like var=$(cat ps.txt)
Instead of using cat to get file contents into a Perl variable, you should use open and <> in "slurp mode":
open my $fh, "<", "ps.txt" or die "Failed to open ps.txt: $!";
local $/;
my $file_contents = <$fh>;
Here are some ways to do it:
#!/usr/bin/perl
$ifile = "ps.txt";
# capture command output
# NOTE: this puts each line in a separate array element -- the newline is _not_
# stripped
#bycat = (`cat $ifile`);
# this strips the newline from all array elements:
chomp(#bycat);
# so would this:
# NOTE: for this type of foreach, if you modify $buf, it also modifies the
# corresponding array element
foreach $buf (#bycat) {
chomp($buf);
}
# read in all elements line-by-line
open($fin,"<$ifile") or die("unable to open '$ifile' -- $!\n");
while ($buf = <$fin>) {
chomp($buf);
push(#byread,$buf);
}
close($fin);
# print the arrays
# NOTE: we are passing the arrays "by-reference"
show("bycat",\#bycat);
show("byread",\#byread);
# show -- dump the array
sub show
# sym -- name of array
# ptr -- reference to array
{
my($sym,$ptr) = #_;
my($buf);
foreach $buf (#$ptr) {
printf("%s: %s\n",$sym,$buf);
}
}
I'm not sure what this is trying to achieve, but this is my answer:
my $var = `/bin/cat $0`; # the Perl program itself ;-)
print $var;
If you need the lines, $var can be split on $/.
#! /usr/bin/perl -w
my $var = `/bin/cat $0`;
print $var;
my $n = 1;
for my $line ( split( $/, $var ) ){
print "$n: $line\n";
$n++;
}

delete lines between two patterns without deleting the pattern

i have a file like below
[NAMES]
biren
bikash
dibya
[MAIL]
biren_k
bikash123
dibya008
my output should be like below
[NAMES]
[MAIL]
i tried the below code just to remove the lines between NAMES and MAIL, but it did not work.
sed -n '/NAMES/{p; :a; N; /MAIL/ba; s/.*\n//}; p' input.txt
Can anyone help please... i would prefer perl code if any...
NOTE: like [NAMES] and [MAIL] , i have a lot of headers in my actual file. here i have just shown two headers. I have to replace the contents below the headers(not all, only selected headers which are at random line numbers) with new contents. but first i nedd to delete the contents below them. Thats why i need my output like this. Any suggestions please...
You can modify sed as
$ sed '/\[NAMES\]/, /\[MAIL\]/ {/^\[/p; d}' input
[NAMES]
[MAIL]
biren_k
bikash123
dibya008
Please try this may be helpful on your question:
%hashes = (
"[NAMES]" => "<br/>kumar<br/>avi<br/><br/>\n",
"[MAIL]" => "<br/>biren_k<br/>bikash123<br/>dibya008<br/>\n"
);
my #arr = <DATA>;
foreach my $snarr(#arr)
{
chomp($snarr);
push(#newarr, "$snarr\n$hashes{$snarr}"), if( $hashes{$snarr} );
}
print #newarr;
__DATA__
[NAMES]
biren
bikash
dibya
[MAIL]
biren_k
bikash123
dibya008
Just replace the lines between my #erase = qw[ and ]; with HEADERS you meant to empty out.
#!/usr/bin/env perl
use strict;
use warnings;
push #ARGV, 'file.txt';
# here list out the HEADERS
# which content you wanna erase
my #erase = qw[
NAMES
MAIL
];
my %dump;
my $header;
# build a hash from your file
while (<>) {
if (/^\[([^\]]+)\]$/) {
$header = $1;
$dump{$header} = "";
next;
}
$dump{$header} .= $_ if $header;
}
# replace the content
# with empty string
foreach (#erase) {
$dump{$_} = "";
}
# now print it back to <STDOUT>
foreach (sort keys %dump) {
print "[$_]\n$dump{$_}\n";
}
I found solution to my problem here:
my #name_var = ();
while (<STDIN>)
{
last if ($_ =~ /^\n/ );
push(#name_var, $_);
}
my #mail_add = ();
while (<STDIN>)
{
last if ($_ =~ /^\n/ );
push(#mail_add, $_);
}
open(my $var, "input.txt") || die("Input File not found");
open(my $out, ">temp.txt") || die("Temp File not created");
while($line = <$var>)
{
# print $line;
if( $line =~ /\[NAMES\]/)
{
print $out $line;
print $out $name_var;
while(($line = <$var>) && ($line !~ /^\n/))
{
}
}
if( $line =~ /\[MAIL\]/)
{
print $out $line;
print $out $mail_add;
while(($line = <$var>) && ($line !~ /^\n/))
{
}
}
print $tcf_out $line;
}
close($var);
close($out);
open($var1,">input.txt") || die("failed to open\n");
open($out1,"<temp.txt") || die("failed to open\n");
while($fl = <$out1>)
{
print $var1 $fl;
}
close($var1);
close($out1);
Thank you all. I got the solution from stack overflow, perlmonk and few more sites related to perl.

How to delete everything between two :'s, but not if between {}'s? [duplicate]

This question already has an answer here:
How to delete a pattern when it is not found between two symbols in Perl?
(1 answer)
Closed 8 years ago.
I have a text file like this:
This is {an example} of : some of the: text.
This is yet {another : example :} of some of the text.
:This: is :still :yet another {:example:} of :some text:.
I need to delete any text found inside any :'s, including the :'s, but not if they fall inside a pair of { and }.
Anything between a { and } is safe, including :'s.
Anything not between a { and } but found between : and : is deleted.
The :'s found outside { and } are all deleted.
The output would look like this:
This is {an example} of text.
This is yet {another : example :} of some of the text.
is yet another {:example:} of .
There is only one set of braces per line.
The paired braces are never split across lines.
There could be any number of :'s on the line, inside or outside the braces.
:'s always come in pairs.
How can I delete everything between colons, including the colons themselves, but not when protected by braces?
My best attempt so far is to use awk -F"{" '{ print $1 }' > file1.txt, awk -F"{" '{ print $2 }' > file2.txt, etc. to split the lines around the braces into different, run sed on the specific files to remove the parts, but not on the files containing the data inside the braces, then to assemble it back together with paste, but this solution is far too complicated.
This will do as you ask
use strict;
use warnings;
my $data = do {
local $/;
<DATA>;
};
my #parts = split m/ ( \{ [^{}]* \} ) /x, $data;
for (#parts) {
s/ : [^:]* : //gx unless /^\{/;
}
print #parts, "\n";
__DATA__
This is {an example} of : some of the: text.
This is yet {another : example :} of some of the text.
:This: is :still :yet another {:example:} of :some text:.
output
This is {an example} of text.
This is yet {another : example :} of some of the text.
is yet another {:example:} of .
this is simple, try the following:
perl -pe 's/({[^{}]*})|:[^:]*:/$1/g' file
all texts inside { } are saved in $1 and thus skipped:)
In Perl:
#!/usr/bin/env perl
while (<>) {
my #chars = split //;
foreach my $c (#chars) {
if ($c eq "{" .. $c eq "}") {
print "$c";
} elsif ($c eq ":" ... $c eq ":") {
}
else {
print "$c";
}
}
}
or put more succinctly:
while (<>) {
print grep {/\{/ .. /\}/ or not /:/ ... /:/} split //;
}
Counting braces and colons:
perl -ne '
$b = $c = 0;
for $char (split //) {
$b++ if $char eq "{";
$b-- if $char eq "}";
if ($b > 0) {
print $char;
}
else {
if ($c == 0 and $char eq ":") {
$c++;
}
else {
print $char if $c == 0;
$c-- if $c == 1 and $char eq ":";
}
}
}
' <<END
This is {an example} of : some of the: text.
This is yet {another : example :} of some of the text.
:This: is :still :yet another {:example:} of :some text:.
END
This is {an example} of text.
This is yet {another : example :} of some of the text.
is yet another {:example:} of .

Script to migrate data from one source to another

I have a .h file, among other things, containing data in this format
struct X[]{
{"Field", "value1 value2 value"},
{"Field2", "value11 value12 value232"},
{"Field3", "x y z"},
{"Field4", "a bbb s"},
{"Field5", "sfsd sdfdsf sdfs"};
/****************/
};
I have text file containing, values that I want to replace in .h file with new values
value1 Valuesdfdsf1
value2 Value1dfsdf
value3 Value1_another
sfsd sfsd_ewew
sdfdsf sdfdsf_ew
sdfs sfsd_new
And the resulting .h file will contain the replacements from the text file above. Everything else remains the same.
struct X[]{
{"Field1", "value11 value12 value232"},
{"Field2", "value11 value12 value232"},
{"Field3", "x y z"},
{"Field4", "a bbb s"},
{"Field5", "sfsd_ewew sdfdsf_ew sdfs_new"};
/****************/
};
Please help me come with a solution to accomplish it using unix tools: awk, perl, bash, sed, etc
cat junk/n2.txt | perl -e '{use File::Slurp; my #r = File::Slurp::read_file("junk/n.txt"); my %r = map {chomp; (split(/\s+/,$_))[0,1]} #r; while (<>) { unless (/^\s*{"/) {print $_; next;}; my ($pre,$values,$post) = ($_ =~ /^(\s*{"[^"]+", ")([^"]+)(".*)$/); my #new_values = map { exists $r{$_} ? $r{$_}:$_ } split(/\s+/,$values); print $pre . join(" ",#new_values) . $post . "\n"; }}'
Result:
struct X[]{
{"Field", "value1 Value1dfsdf value"},
{"Field2", "value11 value12 value232"},
{"Field3", "x y z"},
{"Field4", "a bbb s"},
{"Field5", "sfsd_ewew sdfdsf_ew sfsd_new"};
/****************/
};
Code untangled:
use File::Slurp;
my #replacements = File::Slurp::read_file("junk/n.txt");
my %r = map {chomp; (split(/\s+/,$_))[0,1]} #replacements;
while (<>) {
unless (/^\s*{"/) {print $_; next;}
my ($pre,$values,$post) = ($_ =~ /^(\s*{"[^"]+", ")([^"]+)(".*)$/);
my #new_values = map { exists $r{$_} ? $r{$_} : $_ } split(/\s+/, $values);
print $pre . join(" ",#new_values) . $post . "\n";
}
#!/usr/bin/perl
use strict; use warnings;
# you need to populate %lookup from the text file
my %lookup = qw(
value1 Valuesdfdsf1
value2 Value1dfsdf
value3 Value1_another
sfsd sfsd_ewew
sdfdsf sdfdsf_ew
sdfs sfsd_new
);
while ( my $line = <DATA> ) {
if ( $line =~ /^struct \w+\Q[]/ ) {
print $line;
process_struct(\*DATA, \%lookup);
}
else {
print $line;
}
}
sub process_struct {
my ($fh, $lookup) = #_;
while (my $line = <$fh> ) {
unless ( $line =~ /^{"(\w+)", "([^"]+)"}([,;])\s+/ ) {
print $line;
return;
}
my ($f, $v, $p) = ($1, $2, $3);
$v =~ s/(\w+)/exists $lookup->{$1} ? $lookup->{$1} : $1/eg;
printf qq|{"%s", "%s"}%s\n|, $f, $v, $p;
}
return;
}
__DATA__
struct X[]{
{"Field", "value1 value2 value"},
{"Field2", "value11 value12 value232"},
{"Field3", "x y z"},
{"Field4", "a bbb s"},
{"Field5", "sfsd sdfdsf sdfs"};
/****************/
};
Here's a simple looking program:
use strict;
use warnings;
use File::Copy;
use constant {
OLD_HEADER_FILE => "headerfile.h",
NEW_HEADER_FILE => "newheaderfile.h",
DATA_TEXT_FILE => "data.txt",
};
open (HEADER, "<", OLD_HEADER_FILE) or
die qq(Can't open file old header file ") . OLD_HEADER_FILE . qq(" for reading);
open (NEWHEADER, ">", NEW_HEADER_FILE) or
die qq(Can't open file new header file ") . NEW_HEADER_FILE . qq(" for writing);
open (DATA, "<", DATA_TEXT_FILE) or
die qq(Can't open file data file ") . DATA_TEXT_FILE . qq(" for reading);
#
# Put Replacement Data in a Hash
#
my %dataHash;
while (my $line = <DATA>) {
chomp($line);
my ($key, $value) = split (/\s+/, $line);
$dataHash{$key} = $value if ($key and $value);
}
close (DATA);
#
# NOW PARSE THOUGH HEADER
#
while (my $line = <HEADER>) {
chomp($line);
if ($line =~ /^\s*\{"Field/) {
foreach my $key (keys(%dataHash)) {
$line =~ s/\b$key\b/$dataHash{$key}/g;
}
}
print NEWHEADER "$line\n";
}
close (HEADER);
close (NEWHEADER);
copy(NEW_HEADER_FILE, OLD_HEADER_FILE) or
die qq(Unable to replace ") . OLD_HEADER_FILE . qq(" with ") . NEW_HEADER_FILE . qq(");
I could make it more efficient by using map, but that makes it harder to understand.
Basically:
I open three files, the original Header, the new Header I'm building, and the data file
I first put my data into a hash where the replacement text is keyed by the original text. (Could have done it the other way around if I wanted.
I then go through each line of the original header.
** If I see a line that looks like its a field line, I know that I might have to do a replacement.
** For each entry in my %dataHash, I do a substitution of the $key with the $dataHash{$key} replacement value. I use the \b to mark word boundries. This way, field11 is not substituted because I see field1 in that string.
** Now I write the line back to my new header file. If I didn't replace anything, I just write back the original line.
Once I finish, I copy the new header over the old header file.
This script should work
keyval is the file containing key value pairs
filetoreplace is the file containing data to be modified
The file named changed will contain the changes
#!/bin/sh
echo
keylist=`cat keyval | awk '{ print $1}'`
while read line
do
for i in $keylist
do
if echo $line | grep -wq $i; then
value=`grep -w $i keyval | awk '{print $2}'`
line=`echo $line | sed -e "s/$i/$value/g"`
fi
done
echo $line >> changed
done < filetoreplace
This might be kind of slow if your files are big.
gawk -F '[ \t]*|"' 'FNR == NR {repl[$1]=$2;next}{for (f=1;f<=NF;++f) for (r in repl) if ($f == r) $f=repl[r]; print} ' keyfile file.h

Resources