Parse list of integers (optimization needed for speed test) - performance

I am performing a tiny speed test in order to compare the speed of the Agda programming language with the Tcl scripting language. Its for scientific work and this is just a pre-test, not a real test. I am not in anyway trying to perform a realistic speed comparison!
I have come up with a small example, in which Agda is 10x times faster than Tcl. There are special reasons I use this example. My main concern is that my Tcl code is badly programmed and this is the sole reason Tcl is slower than Agda in this example.
The goal of the code is to parse a line that represents a list of integers and check if it is indeed a list of integers.
Example "(1,2,3)" would be a valid list.
Example "(1,a,3)" would not be a valid list.
My input is a file and I check every third line (3rd) of the file. If any line is not a list of integers, the program prints "false".
My input file:
(613424,505980,317647,870930,75580,897160,716297,668539,689646,196362,533020)
(727375,472272,22435,869407,320468,80779,302881,240382,196077,635360,568517)
(613424,505980,317647,870930,75580,897160,716297,668539,689646,196362,533020)
(however, my real test file is about 3 megabyte large)
My current Tcl code to solve this problem is:
package require Tcl 8.6
proc checkListNat {str} {
set list [split [string map {"(" "" ")" ""} $str] ","]
foreach l $list {
if {[string is integer $l] == 0} {
return 0
}
}
return 1
}
set i 1
set fp [open "/tmp/test.txt" r]
while { [gets $fp data] >= 0 } {
incr i
if { [expr $i % 3] == 0} {
if { [checkListNat $data] == 0 } {
puts "error"
}
}
}
close $fp
How can I optimize my current Tcl code, so that the speed test between Agda and Tcl is more realistic?

The first thing to do is to put as much code in procedures (or lambda terms) as possible and ensure that all expressions are braced. Those were your two key problems that were killing performance. We'll do a few other things too (you hardly ever need expr inside an if test and this wasn't one of those cases, string trim is more suitable than string map, string is really ought to be done with -strict). With those, I get this version which is relatively similar to what you already had yet ought to be substantially more performant.
package require Tcl 8.6
proc checkListNat {str} {
foreach l [split [string trim $str "()"] ","] {
if {[string is integer -strict $l] == 0} {
return 0
}
}
return 1
}
apply {{} {
set i 1
set fp [open "/tmp/test.txt" r]
while { [gets $fp data] >= 0 } {
if {[incr i] % 3 == 0 && ![checkListNat $data]} {
puts "error"
}
}
close $fp
}} {*}$argv
You might get better performance by adding fconfigure $fp -encoding iso8859-1; you'll have to test that yourself. But the key changes are the ones due to the bold items earlier, as each substantially impacts on the efficiency of compilation strategy used. (Also, Tcl 8.5 is a little faster than 8.6 — 8.6 has a radically different execution engine that is a bit slower for some things — so you might test the new code with 8.5 too; the code itself appears to be valid with both versions.)

try checking with regex {^[0-9,]+$} $line instead of the checkListNat function.
update
here is an example
echo "87,566, 45,67\n56,5r5,45" >! try
...
while {[gets $fp line] >0} {
if {[regexp {^[0-9]+$} $line] >0 } {
puts "OK $line"
} else {
puts "BAD $line"
}
}
gives:
>OK 87,566, 45,67
>BAD 56,5r5,45

Related

TCL script to find a script and add new text before searched string with number

Below is the partial content of my input file:
xyz
abc
MainContent(abc_dt) {
sc: it;
}
MainContent(xyz_cnt) {
sc : it;
}
MainContent(asd_zxc) {
sc : it;
}
Here, I want to search "MainContent" line and want to add new line before it ... this new line should have "Sb (text which is inside bracket in MainContent_1)"... this should also add opening bracket and closing bracket before next Sb occurance:
Expected output from the script:
xyz
abc
Sb(abc_dt_sb1) {
MainContent(abc_dt) {
sc: it;
}
}
Sb(xyz_cnt_sb2) {
MainContent(xyz_cnt) {
sc : it;
}
}
Sb(asd_zxc_sb3) {
MainContent(asd_zxc) {
sc : it;
}
}
Can someone please help to me create a TCL script for this?
This code will take your text to be processed on standard input and produce the results on standard output. Redirect it as required.
set counter 0
set expectBrace false
while {[gets stdin line] >= 0} {
if {!$expectBrace && [regexp {^\s*MainContent\s*\((\w+)\)} $line -> bits]} {
puts [format "Sb(%s_sb%d) \{" $bits [incr counter]]
set expectBrace true
}
puts $line
if {$expectBrace && [regexp {^\s*\}\s*$} $line]} {
puts "\}"
set expectBrace false
}
}
Using regular expressions to do the matching of the triggers for state changes in a little state machine (two states, governed by expectBrace) is pretty conventional parsing. I've used format to do the substitutions into the Sb(…) line; these are simple enough that you could use direct substitutions instead if you prefer.
I've not done anything about adding indentation.

Executing a TCL script from line N

I have a TCL script that say, has 30 lines of automation code which I am executing in the dc shell (Synopsys Design Compiler). I want to stop and exit the script at line 10, exit the dc shell and bring it back up again after performing a manual review. However, this time, I want to run the script starting from line number 11, without having to execute the first 10 lines.
Instead of having two scripts, one which contains code till line number 10 and the other having the rest, I would like to make use of only one script and try to execute it from, let's say, line number N.
Something like:
source a.tcl -line 11
How can I do this?
If you have Tcl 8.6+ and if you consider re-modelling your script on top of a Tcl coroutine, you can realise this continuation behaviour in a few lines. This assumes that you run the script from an interactive Tcl shell (dc shell?).
# script.tcl
if {[info procs allSteps] eq ""} {
# We are not re-entering (continuing), so start all over.
proc allSteps {args} {
yield; # do not run when defining the coroutine;
puts 1
puts 2
puts 3
yield; # step out, once first sequence of steps (1-10) has been executed
puts 4
puts 5
puts 6
rename allSteps ""; # self-clean, once the remainder of steps (11-N) have run
}
coroutine nextSteps allSteps
}
nextSteps; # run coroutine
Pack your script into a proc body (allSteps).
Within the proc body: Place a yield to indicate the hold/ continuation point after your first steps (e.g., after the 10th step).
Create a coroutine nextSteps based on allSteps.
Protect the proc and coroutine definitions in a way that they do not cause a re-definition (when steps are pending)
Then, start your interactive shell and run source script.tcl:
% source script.tcl
1
2
3
Now, perform your manual review. Then, continue from within the same shell:
% source script.tcl
4
5
6
Note that you can run the overall 2-phased sequence any number of times (because of the self-cleanup of the coroutine proc: rename):
% source script.tcl
1
2
3
% source script.tcl
4
5
6
Again: All this assumes that you do not exit from the shell, and maintain your shell while performing your review. If you need to exit from the shell, for whatever reason (or you cannot run Tcl 8.6+), then Donal's suggestion is the way to go.
Update
If applicable in your case, you may improve the implementation by using an anonymous (lambda) proc. This simplifies the lifecycle management (avoiding re-definition, managing coroutine and proc, no need for a rename):
# script.tcl
if {[info commands nextSteps] eq ""} {
# We are not re-entering (continuing), so start all over.
coroutine nextSteps apply {args {
yield; # do not run when defining the coroutine;
puts 1
puts 2
puts 3
yield; # step out, once first sequence of steps (1-10) has been executed
puts 4
puts 5
puts 6
}}
}
nextSteps
The simplest way is to open the text file, parse it to get the first N commands (info complete is useful there), and then evaluate those (or the rest of the script). Doing this efficiently produces slightly different code when you're dropping the tail as opposed to when you're dropping the prefix.
proc ReadAllLines {filename} {
set f [open $filename]
set lines {}
# A little bit careful in case you're working with very large scripts
while {[gets $f line] >= 0} {
lappend lines $line
}
close $f
return $lines
}
proc SourceFirstN {filename n} {
set lines [ReadAllLines $filename]
set i 0
set script {}
foreach line $lines {
append script $line "\n"
if {[info complete $script] && [incr i] >= $n} {
break
}
}
info script $filename
unset lines
uplevel 1 $script
}
proc SourceTailN {filename n} {
set lines [ReadAllLines $filename]
set i 0
set script {}
for {set j 0} {$j < [llength $lines]} {incr j} {
set line [lindex $lines $j]
append script $line "\n"
if {[info complete $script]} {
if {[incr i] >= $n} {
info script $filename
set realScript [join [lrange $lines [incr j] end] "\n"]
unset lines script
return [uplevel 1 $realScript]
}
# Dump the prefix we don't need any more
set script {}
}
}
# If we get here, the script had fewer than n lines so there's nothing to do
}
Be aware that the kinds of files you're dealing with can get pretty large, and Tcl currently has some hard memory limits. On the other hand, if you can source the file at all, you're already within that limit…

How can I make DOT correctly process UTF-8 to PostScript and have multiple graph/pages?

This dot source
graph A
{
a;
}
graph B
{
"Enûma Eliš";
}
when compiled with dot -Tps generates this error
Warning: UTF-8 input uses non-Latin1 characters which cannot be handled by this PostScript driver
I can fix the UTF-8 problem by passing -Tps:cairo but then only graph A is in the output -- it is truncated to a single page. The same happens with -Tpdf. There are no other postscript driver available on my installation.
I could split the graphs into separate files and concatenate them afterwards, but I'd rather not. Is there a way to have correct UTF-8 handling and multiple page output?
Generating PDF or SVG could bypass the encoding problem too.
dot -Tpdf chs.dot > chs.pdf
// or
dot -Tsvg chs.dot > chs.svg
Apparently the dot PS driver can't handle other encodings than the old ISO8859-1. I think it can't change fonts either.
One thing you can do is to run a filter to change dot's PostScript output. The following Perl program does that, it's an adaptation of some code I had. It changes the encoding from UTF-8 to a modified ISO encoding with extra characters replacing unused ones.
Of course, the output still depends on the font having the characters. Since dot (I think) only uses the default PostScript fonts, anything beyond the "standard latin" is out of the question...
It works with Ghostscript or with any interpreter which defines AdobeGlyphList.
The filter should be used this way:
dot -Tps graph.dot | perl reenc.pl > output.ps
Here it is:
#!/usr/bin/perl
use strict;
use warnings;
use open qw(:std :utf8);
my $ps = do { local $/; <STDIN> };
my %high;
my %in_use;
foreach my $char (split //, $ps) {
my $code = (unpack("C", $char))[0];
if ($code > 127) {
$high{$char} = $code;
if ($code < 256) {
$in_use{$code} = 1;
}
}
}
my %repl;
my $i = 128;
foreach my $char (keys %high) {
if ($in_use{$high{$char}}) {
$ps =~ s/$char/sprintf("\\%03o", $high{$char})/ge;
next;
}
while ($in_use{$i}) { $i++; }
$repl{$i} = $high{$char};
$ps =~ s/$char/sprintf("\\%03o", $i)/ge;
$i++;
}
my $psprocs = <<"EOPS";
/EncReplacements <<
#{[ join(" ", %repl) ]}
>> def
/RevList AdobeGlyphList length dict dup begin
AdobeGlyphList { exch def } forall
end def
% code -- (uniXXXX)
/uniX { 16 6 string cvrs dup length 7 exch sub exch
(uni0000) 7 string copy dup 4 2 roll putinterval } def
% font code -- glyphname
/unitoname { dup RevList exch known
{ RevList exch get }
{ uniX cvn } ifelse
exch /CharStrings get 1 index known not
{ pop /.notdef } if
} def
/chg-enc { dup length array copy EncReplacements
{ currentdict exch unitoname 2 index 3 1 roll put } forall
} def
EOPS
$ps =~ s{/Encoding EncodingVector def}{/Encoding EncodingVector chg-enc def};
$ps =~ s/(%%BeginProlog)/$1\n$psprocs/;
print $ps;

Perl, cmd, $ARGV[0], slow

[Strawberry Perl v5.16.3, Windows 7 x64, executing via cmd, eg c:\strawberry> perl test.pl 100000]
SYMPTOM: The following code: foreach (1..$ARGV[0]) { foo($_); }, executes roughly 20% slower than if I had included this extra line, before it: my $num = $ARGV[0];
QUESTION: Can anyone help me understand why?
Notice, in the second case, that after I initialize and set $num, I do not then use $num in the loop parameters. Were this the case, I could probably be convinced that repeatedly testing against $ARGV[0] in a forloop is somehow slower than a variable that I define myself... but this is not the case.
To track time, I use: use Time::HiRes; my $time = [Time::HiRes::gettimeofday()]; at the top of my script, and: print "\n1: ", Time::HiRes::tv_interval($time); at the bottom.
Confused!
Thanks,
Michael
EDIT
I am including the entire script, with a comment preceding the offending line... Interestingly, it looks like the time discrepancy is at least partially dependent on my redundant initialization of %h, as well as #chain... This is getting weird.
use Time::HiRes; my $time = [Time::HiRes::gettimeofday()];
#my $max=$ARGV[0];
my %h = (1=>1,89=>89);
$h{1}=1;
$h{89}=89;
my #chain=();
my $ans=0;
sub sum{my $o=0; foreach (#_){$o+=$_}; return $o;}
foreach (1..$ARGV[0]-1){
my $x=$_;
my #chain = ();
while(!exists($h{$x})){
push(#chain,$x);
$x = sum(map {$_**2} split('',$x));
}
foreach (#chain){$h{$_}=$h{$x} if !exists($h{$_});}
}
print "\n1: ", Time::HiRes::tv_interval($time);
foreach (1..$ARGV[0]){$ans++ if ($h{$_}==89);}
print "\n2: ", Time::HiRes::tv_interval($time);
On my system (perl 5.16.3 on GNU/Linux) there is no measurable difference. The standard deviation of the timings is larger than the difference between measurements of different versions.
For each variant of the script, 10 executions were performed. The $ARGV[0] was 3.5E5 in all cases (350000).
Without my $num = $ARGV[0]:
$ perl measure.pl
2.369921 2.38991 2.380969 4.419895 2.398861 2.420928 2.388721 2.368144 2.387212 2.386347
mean: 2.5910908
sigma: 0.609763793801797
With my $num = $ARGV[0]:
$ perl measure.pl
4.435764 2.419485 2.403696 2.401771 2.411345 2.466776 4.408127 2.416889 2.389191 2.397409
mean: 2.8150453
sigma: 0.803721101668365
The measure.pl script:
use strict; use warnings; use 5.016;
use List::Util 'sum';
my #times = map qx/perl your-algorithm.pl 3.5E5/, 1..10;
chomp #times;
say "#times";
say "mean: ", mean(#times);
say "sigma: ", sigma(#times);
sub mean { sum(#_)/#_ }
sub sigma {
my $mean = mean(#_);
my $variance = sum(map { ($_-$mean)**2 } #_) / #_;
sqrt $variance;
}
With your-algorithm.pl being reduced so that only one timing is printed:
foreach (1..$ARGV[0]){$ans++ if ($h{$_}==89);}
print Time::HiRes::tv_interval($time), "\n";

Perl Out Of Memory

I have a script that reads two csv files and compares them to find out if an ID that appears in one also appears in the other. The error I am receiving is as follows:
Out of memory during "large" request for 67112960 bytes, total sbrk() is 348203008 bytes
And now for the code:
use strict;
use File::Basename;
my $DAT = $ARGV[0];
my $OPT = $ARGV[1];
my $beg_doc = $ARGV[2];
my $end_doc = $ARGV[3];
my $doc_counter = 0;
my $page_counter = 0;
my %opt_beg_docs;
my %beg_docs;
my ($fname, $dir, $suffix) = fileparse($DAT, qr/\.[^.]*/);
my $outfile = $dir . $fname . "._IMGLOG";
open(OPT, "<$OPT");
while(<OPT>){
my #OPT_Line = split(/,/, $_);
$beg_docs{#OPT_Line[0]} = "Y" if(#OPT_Line[3] eq "Y");
$opt_beg_docs{#OPT_Line[0]} = "Y";
}
close(OPT);
open(OUT, ">$outfile");
while((my $key, my $value) = each %opt_beg_docs){
print OUT "$key\n";
}
close(OUT);
open(DAT, "<$DAT");
readline(DAT); #skips header line
while(<DAT>){
$_ =~ s/\xFE//g;
my #DAT_Line = split(/\x14/, $_);
#gets the prefix and the range of the beg and end docs
(my $pre = #DAT_Line[$beg_doc]) =~ s/[0-9]//g;
(my $beg = #DAT_Line[$beg_doc]) =~ s/\D//g;
(my $end = #DAT_Line[$end_doc]) =~ s/\D//g;
#print OUT "BEGDOC: $beg ENDDOC: $end\n";
foreach($beg .. $end){
my $doc_id = $pre . $_;
if($opt_beg_docs{$doc_id} ne "Y"){
if($beg_docs{$doc_id} ne "Y"){
print OUT "$doc_id,DOCUMENT NOT FOUND IN OPT FILE\n";
$doc_counter++;
} else {
print OUT "$doc_id,PAGE NOT FOUND IN OPT FILE\n";
$page_counter++;
}
}
}
}
close(DAT);
close(OUT);
print "Found $page_counter missing pages and $doc_counter missing document(s)";
Basically I get all the ID's from the file I am checking against to see if the ID exists in. Then I loop over the and generate the ID's for the other file, because they are presented as a range. Then I take the generated ID and check for it in the hash of ID's.
Also forgot to note I am using Windows
You're not using use warnings;, you're not checking for errors on opening files, and you're not printing out debugging statements showing the lines that you are reading in.
Do you know what the input file looks like? If it has no line breaks, you are reading the entire file in all at once, which will be disastrous if it is large. Pay attention to how you are parsing the file.
I'm not sure if it's the cause of your error, but inside your loop where you're reading DAT, you probably want to replace this:
(my $pre = #DAT_Line[$beg_doc]) =~ s/[0-9]//g;
with this:
(my $pre = $DAT_Line[$beg_doc]) =~ s/[0-9]//g;
and same for the other two lines there.
You're closing your OUT file handle and then trying to print to it inside the DAT loop, which, I think might be outputting to random memory, since you closed the FILEHANDLE - surprised this didn't output an error.
Remove the first close(OUT); and see if that improves.
I still don't know what your question is, if it's about the error message it means you've run out of memory. If it's about the message itself - you're trying to consume too much memory. If it's why you're consuming too much memory, I'd first ask if you read my message above, then I'd ask how much memory your system has, then I'd follow up with seeing if it improves if you take the regex away.

Resources