I am trying filling a hash in perl from a file of around 564k lines, and the code is taking like 1.6~2.1 seconds to execute, while the equivalent in C# takes around 0.8 seconds to finish. Is there any better way to do it in Perl ?
I have tried so far :
# 1 - this version take ~ +1.6 seconds to fill the hash from file with ~ 564000
my %voc;
while(defined(my $line=<F>)) {
$voc{$line} = 1;
and this
# 2 - this version take ~ +2.1 seconds to fill the hash from file with ~ 564000
my %voc;
my #voc_keys;
my #array_of_ones;
my $voc_keys_index = 0;
while(defined(my $line=<F>)) {
$voc_keys[$voc_keys_index] = $line;
$array_of_ones[$voc_keys_index] = 1;
$voc_keys_index ++;
#voc{#voc_keys} = #array_of_ones;
In c#, I am using :
var voc = new Dictionary<String, int>();
foreach (string line in File.ReadLines(pathToVoc_file))
var trimmedline = line.TrimEnd(new char[] { '\n' });
voc[trimmedline] = 1;
And it takes only 700~800 ms
Definitely avoiding storing 1's as the data and using exists can save time and memory. You can eke out even more by removing the block from the loop:
my %voc;
chomp, undef $voc{$_} while <F>;
Benchmark results (using 20 character lines):
Benchmark: running ikegami, original, statementmodifier, statementmodifier_undef for at least 10 CPU seconds...
ikegami: 10 wallclock secs ( 9.54 usr + 0.46 sys = 10.00 CPU) # 2.10/s (n=21)
original: 10 wallclock secs ( 9.62 usr + 0.45 sys = 10.07 CPU) # 2.09/s (n=21)
statementmodifier: 10 wallclock secs ( 9.61 usr + 0.48 sys = 10.09 CPU) # 2.18/s (n=22)
statementmodifier_undef: 11 wallclock secs ( 9.85 usr + 0.48 sys = 10.33 CPU) # 2.23/s (n=23)
use strict;
use warnings;
use Benchmark 'timethese';
my $voc_file = 'rand.txt';
sub check {
my ($voc) = #_;
unless (keys %$voc == 564000) {
warn "bad number of keys ", scalar keys %$voc;
chomp(my $expected_line = `head -1 $voc_file`);
unless (exists $voc->{$expected_line}) {
warn "bad data";
timethese(-10, {
'statementmodifier' => sub {
my %voc;
chomp, $voc{$_} = 1 while <F>;
'statementmodifier_undef' => sub {
my %voc;
chomp, undef $voc{$_} while <F>;
'original' => sub {
my %voc;
while(defined(my $line=<F>)) {
$voc{$line} = 1;
'ikegami' => sub {
my %voc;
while(defined(my $line=<F>)) {
undef $voc{$line};
(Original incorrect answer replaced with this.)
Of course C# is going to be faster.
You could save a little time and some memory by replacing
$voc{$line} = 1; ... if ($voc{$key}) { ... } ...
undef $voc{$line}; ... if (exists($voc{$key})) { ... } ...
I have an array of strings of about 100,000 elements. I need to iterate through each element and replace some words with other words. This takes a few seconds in pure perl. I need to speed this up as much as I can. I'm testing using the following snippet:
use strict;
my $string = "This is some string. Its only purpose is for testing.";
for( my $i = 1; $i < 100000; $i++ ) {
$string =~ s/old1/new1/ig;
$string =~ s/old2/new2/ig;
$string =~ s/old3/new3/ig;
$string =~ s/old4/new4/ig;
$string =~ s/old5/new5/ig;
I know this doesn't actually replace anything in the test string, but it's for speed testing only.
I had my hopes set on Inline::C. I've never worked with Inline::C before but after reading up on it a bit, I thought it was fairly simple to implement. But apparently, even calling a stub function that does nothing is a lot slower. Here's the snippet I tested with:
use strict;
use Benchmark qw ( timethese );
use Inline 'C';
"Pure Perl" => \&pure_perl,
"Inline C" => \&inline_c
sub pure_perl {
my $string = "This is some string. Its only purpose is for testing.";
for( my $i = 1; $i < 1000000; $i++ ) {
$string =~ s/old1/new1/ig;
$string =~ s/old2/new2/ig;
$string =~ s/old3/new3/ig;
$string =~ s/old4/new4/ig;
$string =~ s/old5/new5/ig;
sub inline_c {
my $string = "This is some string. Its only purpose is for testing.";
for( my $i = 1; $i < 1000000; $i++ ) {
$string = findreplace( $string, "old1", "new1" );
$string = findreplace( $string, "old2", "new2" );
$string = findreplace( $string, "old3", "new3" );
$string = findreplace( $string, "old4", "new4" );
$string = findreplace( $string, "old5", "new5" );
char *
findreplace( char *text, char *what, char *with ) {
return text;
on my Linux box, the result is:
Benchmark: timing 5 iterations of Inline C, Pure Perl...
Inline C: 6 wallclock secs ( 5.51 usr + 0.02 sys = 5.53 CPU) # 0.90/s (n=5)
Pure Perl: 2 wallclock secs ( 2.51 usr + 0.00 sys = 2.51 CPU) # 1.99/s (n=5)
Pure Perl is twice as fast as calling an empty C function. Not at all what I expected! Again, I've never worked with Inline::C before so maybe I am missing something here?
In the version using Inline::C, you kept everything that was in the original pure Perl script, and changed just one thing: Additionally, you've replaced Perl's highly optimized s/// with a worse implementation. Invoking your dummy function actually involves work whereas none of the s/// invocations do much in this case. It is a priori impossible for the Inline::C version to run faster.
On the C side, the function
char *
findreplace( char *text, char *what, char *with ) {
return text;
is not a "do nothing" function. Calling it involves unpacking arguments. The string pointed to by text has to be copied to the return value. There is some overhead which you are paying for each invocation.
Given that s/// does no replacements, there is no copying involved in that. In addition, Perl's s/// is highly optimized. Are you sure you can write a better find & replace that is faster to make up for the overhead of calling an external function?
If you use the following implementation, you should get comparable speeds:
sub inline_c {
my $string = "This is some string. It's only purpose is for testing.";
for( my $i = 1; $i < 1000000; $i++ ) {
findreplace( $string );
findreplace( $string );
findreplace( $string );
findreplace( $string );
findreplace( $string );
void findreplace( char *text ) {
Benchmark: timing 5 iterations of Inline C, Pure Perl...
Inline C: 6 wallclock secs ( 5.69 usr + 0.00 sys = 5.69 CPU) # 0.88/s (n=5)
Pure Perl: 6 wallclock secs ( 5.70 usr + 0.00 sys = 5.70 CPU) # 0.88/s (n=5)
The one possibility of gaining speed is to exploit any special structure involved in the search pattern and replacements and write something to implement that.
On the Perl side, you should at least pre-compile the patterns.
Also, since your problem is embarrassingly parallel, you are better off looking into chopping up the work into as many chunks as you have cores to work with.
For example, take a look at the Perl entries in the regex-redux task in the Benchmarks Game:
Perl #4 (fork only): 14.13 seconds
Perl #3 (fork & threads): 14.47 seconds
Perl #1: 34.01 seconds
That is, some primitive exploitation of parallelization possibilities results in a 60% speedup. That problem is not exactly comparable because the substitutions must be done sequentially, but still gives you an idea.
If you have eight cores, dole out the work to eight cores.
Also, consider the following script:
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Fake::Text;
use List::Util qw( sum );
use Time::HiRes qw( time );
use constant INPUT_SIZE => $ARGV[0] // 1_000_000;
sub run {
my #substitutions = (
sub { s/dolor/new1/ig },
sub { s/fuga/new2/ig },
sub { s/facilis/new3/ig },
sub { s/tempo/new4/ig },
sub { s/magni/new5/ig },
my #times;
for (1 .. 5) {
my $data = read_input();
my $t0 = time;
find_and_replace($data, \#substitutions);
push #times, time - $t0;
printf "%.4f\n", sum(#times)/#times;
sub find_and_replace {
my $data = shift;
my $substitutions = shift;
for ( #$data ) {
for my $s ( #$substitutions ) {
my #input;
sub read_input {
or #input = map fake_sentences(1)->(), 1 .. INPUT_SIZE;
return [ #input ];
In this case, each invocation of find_and_replace takes about 2.3 seconds my laptop. The five replications run in about 30 seconds. The overhead is the combined cost of generating the 1,000,000 sentence data set and copying it four times.
I am asked to do the perl program to find a value(from user input) in array. If matched "its ok". If not matched, then check within the value in the index[0] to index[1] ... index[n]. So then if the value matched to the between two elements then report which is near to these elements might be index[0] or index[1].
Let you explain.
Given array : 10 15 20 25 30;
Get the value from user : 14 (eg.)
Hence 14 matched with in the two elements that is 10(array[0]) - 15(array[1])
Ultimately the check point is do not use more than one for loop and never use the while loop. You need to check one for loop and many of if conditions.
I got the output by which I did here is:
use strict;
use warnings;
my #arr1 = qw(10 15 20 25 30);
my $in = <STDIN>;
if(grep /$in/, #arr1)
{ } #print "S: $in\n"; }
for(my $i=0; $i<scalar(#arr1); $i++)
my $j = $i + 1;
if($in > $arr1[$i] && $in < $arr1[$j])
#print "SN: $arr1[$i]\t$arr1[$j]\n";
my ($inc, $dec) = "0";
my $chk1 = $arr1[$i] + 1;
if($in == $chk1)
{ }
{ $chk1++; $inc++; goto AGAIN1; }
my $chk2 = $arr1[$j] - 1;
if($in == $chk2){ }
{ $chk2--; $dec++; goto AGAIN2; }
if($inc > $dec)
{ print "Matched value nearest to $arr1[$j]\n"; }
elsif($inc < $dec)
{ print "Matched value nearest to $arr1[$i]\n"; }
However my question is there a way in algorithm?. Hence if someone can help on this one and it would be appreciated.
Thanks in advance.
You seem determined to make this as complicated as possible :-)
Your specification isn't completely clear, but I think this does what you want:
use strict;
use warnings;
use 5.010;
my #array = qw[10 15 20 25 30];
chomp(my $in = <STDIN>);
if ($in < $array[0]) {
say "$in is less than first element in the array";
if ($in > $array[-1]) {
say "$in is greater than last element in the array";
for (0 .. $#array) {
if ($in == $array[$_]) {
say "$in is in the array";
if ($in < $array[$_]) {
if ($in - $array[$_ - 1] < $array[$_] - $in) {
say "$in is closest to $array[$_ - 1]";
} else {
say "$in is closest to $array[$_]";
say "Shouldn't get here!";
Using the helper functions any and reduce from the core module List::Util and the built in abs.
use strict;
use warnings;
use List::Util qw/reduce any/;
my #arr1 = qw(10 15 20 25 30);
chomp(my $in = <STDIN>);
if (any {$in == $_} #arr1) {
print "$in is in the array\n";
else {
my $i = reduce { abs($in - $arr1[$a]) > abs($in - $arr1[$b]) ? $b : $a} 0 .. $#arr1;
print "$in is closest to $arr1[$i]\n";
I was wondering based on many books on Internet, if $_ is really faster way of iterating through array (no instantiating of new variable), but somehow I always get different results. Here's the performance code test:
use Time::HiRes qw(time);
use strict;
use warnings;
# $_ is a default argument for many operators, and also for some control structures.
my $test_array = [1..1000000];
my $number_of_tests = 100;
my $dollar_wins = 0;
my $dollar_wins_sum = 0;
for (my $i = 1; $i <= $number_of_tests; $i++) {
my $odd_void_array = [];
my $start_time_1 = time();
foreach my $item (#{$test_array}) {
if ($item % 2 == 1) {
push (#{$odd_void_array}, $item);
foreach my $item_odd (#{$odd_void_array}) {
my $end_time_1 = time();
$odd_void_array = [];
my $start_time_2 = time();
foreach (#{$test_array}) {
if ($_ % 2 == 1) {
push (#{$odd_void_array}, $_);
foreach (#{$odd_void_array}) {
my $end_time_2 = time();
my $diff = ($end_time_1-$start_time_1) - ($end_time_2-$start_time_2);
if ($diff > 0) {
$dollar_wins ++;
$dollar_wins_sum += $diff;
print "Dollar won ($dollar_wins out of $i) with diff $diff \n";
print "=================================\n";
print "When using dollar underscore, execution was faster in $dollar_wins cases (".(($dollar_wins/$number_of_tests)*100)."%), with average difference of ".($dollar_wins_sum/$dollar_wins)."\n";
So, I have twice iterating (once with assigning to my $item, other without). I get mostly that iterating with $_ was faster in about 20-30% cases.
Shouldn't be iterating without new variable be faster?
You aren't really benchmarking iteration with different variables.
Your timings includes array creation and other calculations.
You only tell which is faster, not by how much.
You have too few iterations to tell anything reliable.
Let's take this better test that actually benchmarks what you are claiming to benchmark:
use strict;
use warnings;
use Benchmark ':hireswallclock', 'cmpthese';
my #numbers = 1..100_000;
cmpthese -3, {
'$_' => sub {
for (#numbers) {
'my $x' => sub {
for my $x (#numbers) {
'$x' => sub {
my $x;
for $x (#numbers) {
Rate $_ my $x $x
$_ 107/s -- -0% -0%
my $x 107/s 0% -- -0%
$x 108/s 0% 0% --
So they are equally fast on my test system (perl 5.18.2 built for i686-linux-thread-multi-64int).
My suspicion is that using $_ is slightly slower than a lexical, as it's a global variable. However, the speed of iteration is equivalent. Indeed, modifying the benchmark…
use strict;
use warnings;
use Benchmark ':hireswallclock', 'cmpthese';
my #numbers = 1..100_000;
cmpthese -3, {
'$_' => sub {
for (#numbers) {
$_ % 2 == 0;
'my $x' => sub {
for my $x (#numbers) {
$x % 2 == 0;
'$x' => sub {
my $x;
for $x (#numbers) {
$x % 2 == 0;
… gives
Rate $_ $x my $x
$_ 40.3/s -- -1% -6%
$x 40.6/s 1% -- -5%
my $x 42.9/s 7% 6% --
but the effects are still too small to draw any solid conclusion.
Below code runs on 5GB file and it consumes 99% CPU, I wanted to know if I am doing something terribly wrong or anything can improve execution time.
2013-04-03 08:54:19,989 INFO [Logger] 2013-04-03T08:54:19.987-04:00PCMC.common.manage.springUtil<log-message-body><headers><fedDKPLoggingContext id="DKP_DumpDocumentProperties" type="context.generated.FedDKPLoggingContext"><logFilter>7</logFilter><logSeverity>255</logSeverity><schemaType>PCMC.MRP.DocumentMetaData</schemaType><UID>073104c-4e-4ce-bda-694344ee62</UID><consumerSystemId>JTR</consumerSystemId><consumerLogin>jbserviceid</consumerLogin><logLocation>Successful Completion of Service</logLocation></fedDKPLoggingContext></headers><payload>0</payload></log-message-body>
This is the code I am using. I tried with gz format also but all in vain. And I am call this awk from bash this below command.
awk -f mytest.awk <(gzip -dc scanned-file.$yesterday.gz)| gzip > tem.gz
cat mytest.awk
#!/bin/awk -f
function to_ms (time, time_ms, s) {
split(time, s, /:|\,/ )
time_ms = (s[1]*3600+s[2]*60+s[3])*1000+s[4]
#printf ("%s\n", newtime)
return time_ms
stid = gensub(/.*UID>([^&]+).*/,"\\1","")
(stid in starttime) {
etime = to_ms($2)
endtime[stid] = etime
docid[stid] = gensub(/.*id="([^""]+).*/,"\\1","")
consumer[stid]= gensub(/.*schemaType>PNC.([^.]+).*/,"\\1","")
state[stid]= gensub(/.*lt;logLocation>([^'' ]+).*/,"\\1","")
stime = to_ms($2)
starttime[stid] = stime
st_hour[stid] = stime/(60*60*1000)
timestamp[stid] = $1" "$2
print "Document,Consumer,Hour,ResponseTime,Timestamp,State"
for (x in starttime) {
for (y in endtime) {
if (x==y) {
diff = (endtime[y]-starttime[x])
st = sprintf("%02d", st_hour[x])
print docid[y], consumer[y], st":00", diff, timestamp[x], state[y] |"sort -k3"
delete starttime[x]
delete endtime[y]
delete docid[y]
delete consumer[y]
delete timestamp[x]
delete state[y]
Assuming there's just one end time for every stid - don't build up an array of start times and an array of end times and then loop through them both, just process the stid when you hit it's end time. i.e. not this as you do today:
{ stid = whatever }
stid in starttime {
populate endtime[stid]
{ populate starttime[stid] }
for (x in starttime) {
for (y in endtime) {
if (x == y) {
stid = x
process endtime[stid] - starttime[stid]
but this:
{ stid = whatever }
stid in starttime {
process to_ms($2) - starttime[stid]
delete starttime[stid]
{ populate starttime[stid] }
If you can't do that, e.g. due to there being multiple records with the same stid and you want to get the time stamps from the first and last ones, then change the loop in your END section to just loop through the stids that you got endtimes for (since you already KNOW they have corresponding starttimes) instead of trying to find all the stids in those massive loops on starttime and endtime, e.g.:
{ stid = whatever }
stid in starttime {
populate endtime[stid]
{ populate starttime[stid] }
for (stid in endtime) {
process endtime[stid] - starttime[stid]
With either approach you should see a massive performance improvement.
In the END section it always goes through the internal for-loop, even if the y item is found and then deleted from the endtime array. I would suggest to use a break to jump out of the internal loop.
On the other hand (as I see) the internal loop is not needed at all! It tries to find an element with a known key in an associative array.
Also I may suggest no to delete the found items. To find an item in associative array is done in constant time (depending on the hash key generated algorithm and how many duplicated items are generated), so removing items from such array will not necessarily speed up the process, but the deletion of items will definitely slow it down.
So I may suggest to use this:
for (x in starttime) {
if (x in endtime) {
diff = (endtime[x]-starttime[x])
st = sprintf("%02d", st_hour[x])
print docid[x], consumer[x], st":00", diff, timestamp[x], state[x] |"sort -k3"
Using gzip will even consume more CPU resources, but You can spare some I/O bandwidth.
# Ed, first approach is not giving me expected result. this is what did
# end time and diff
(stid in starttime)
{ etime = to_ms($2)
diff = etime - stime
print diff,stid
delete starttime[stid]
next }
# Populate starttime
stime = to_ms($2)
starttime[stid] = stime
st_hour[stid] = stime/(60*60*1000)
is like left this sud come in milisecond and with stid.
561849 c858591f-e01b-4407-b9f9-48302b65c383
562740 c858591f-e01b-4407-b9f9-48302b65c383
563629 56c71ef3-d952-4261-9711-16b18a32c6ba
564484 56c71ef3-d952-4261-9711-16b18a32c6ba
I sometimes access a hash like this:
if(exists $ids{$name}){
$id = $ids{$name};
Is that good practice? I'm a bit concerned that it contains two lookups where really one should be done. Is there a better way to check the existence and assign the value?
By checking with exists, you prevent autovivification. See Autovivification : What is it and why do I care?.
UPDATE: As trendels points out below, autovivification does not come into play in the example you posted. I am assuming that the actual code involves multi-level hashes.
Here is an illustration:
use strict;
use warnings;
use Data::Dumper;
my (%hash, $x);
if ( exists $hash{test}->{vivify} ) {
$x = $hash{test}->{vivify}->{now};
print Dumper \%hash;
$x = $hash{test}->{vivify}->{now};
print Dumper \%hash;
C:\Temp> t
$VAR1 = {
'test' => {}
$VAR1 = {
'test' => {
'vivify' => {}
You could use apply Hash::Util's lock_keys to the hash. Then perform your assignments within an eval.
use Hash::Util qw/lock_keys/;
my %a = (
1 => 'one',
2 => 'two'
eval {$val = $a{2}}; # this assignment completes
eval {$val = $a{3}}; # this assignment aborts
print "val=$val\n"; # has value 'two'
You can do it with one lookup like this:
$tmp = $ids{$name};
$id = $tmp if (defined $tmp);
However, I wouldn't bother unless I saw that that was a bottleneck
performance is not important in this case see "Devel::NYTProf".
But to answer your question:
if the value in the hash does not exists, "exists" is very fast
if(exists $ids{$name}){
$id = $ids{$name};
but if it does exists a second lookup is done.
if the value is likely to exists than making only one look up will be faster
$id = $ids{$name};
see this littel benchmark from a perl mailing list.
#!/usr/bin/perl -w
use strict;
use Benchmark qw( timethese );
use vars qw( %hash );
#hash{ 'A' .. 'Z', 'a' .. 'z' } = (1) x 52;
my $key = 'xx';
timethese 10000000, {
'defined' => sub {
if (defined $hash{$key}) { my $x = $hash{$key}; return $x; };
return 0;
'defined_smart' => sub {
my $x = $hash{$key};
if (defined $x) {
return $x;
return 0;
'exists' => sub {
if (exists $hash{$key}) { my $x = $hash{$key}; return $x; };
return 0;
'as is' => sub {
if ($hash{$key}) { my $x = $hash{$key}; return $x; };
return 0;
'as is_smart' => sub {
my $x = $hash{$key};
if ($x) { return $x; };
return 0;
using a key('xx') that does not exists shows that 'exists' is the winner.
Benchmark: timing 10000000 iterations of as is, as is_smart, defined, defined_smart, exists...
as is: 1 wallclock secs ( 1.52 usr + 0.00 sys = 1.52 CPU) # 6578947.37/s (n=10000000)
as is_smart: 3 wallclock secs ( 2.67 usr + 0.00 sys = 2.67 CPU) # 3745318.35/s (n=10000000)
defined: 3 wallclock secs ( 1.53 usr + 0.00 sys = 1.53 CPU) # 6535947.71/s (n=10000000)
defined_smart: 3 wallclock secs ( 2.17 usr + 0.00 sys = 2.17 CPU) # 4608294.93/s (n=10000000)
exists: 1 wallclock secs ( 1.33 usr + 0.00 sys = 1.33 CPU) # 7518796.99/s (n=10000000)
using a key('x') that does exists shows that 'as is_smart' is the winner.
Benchmark: timing 10000000 iterations of as is, as is_smart, defined, defined_smart, exists...
as is: 3 wallclock secs ( 2.76 usr + 0.00 sys = 2.76 CPU) # 3623188.41/s (n=10000000)
as is_smart: 3 wallclock secs ( 1.81 usr + 0.00 sys = 1.81 CPU) # 5524861.88/s (n=10000000)
defined: 3 wallclock secs ( 3.42 usr + 0.00 sys = 3.42 CPU) # 2923976.61/s (n=10000000)
defined_smart: 2 wallclock secs ( 2.32 usr + 0.00 sys = 2.32 CPU) # 4310344.83/s (n=10000000)
exists: 3 wallclock secs ( 2.83 usr + 0.00 sys = 2.83 CPU) # 3533568.90/s (n=10000000)
if it is not a multi-level hash you can do this:
$id = $ids{$name} || 'foo';
or if $id already has a value:
$id ||= $ids{$name};
where 'foo' is a default or fall-through value. If it is a multi-level hash you would use 'exists' to avoid the autovivification discussed earlier in the thread or not use it if autovivification is not going to be a problem.
If I want high performance I'm used to write this idiom when want create hash as set:
my %h;
for my $key (#some_vals) {
$h{$key} = undef unless exists $h{$key};
return keys %h;
This code is little bit faster than commonly used $h{$key}++. exists avoids useless assignment and undef avoids allocation for value. Best answer for you is: Benchmark it! I guess that exists $ids{$name} is little bit faster than $id=$ids{$name} and if you have big miss ratio your version with exists can be faster than assignment and test after.
For example if I want fast sets intersection I would wrote something like this.
sub intersect {
my $h;
#$h{#{shift()}} = ();
my $i;
for (#_) {
return unless %$h;
$i = {};
#$i{grep exists $h->{$_}, #$_} = ();
$h = $i;
return keys %$h;