I intend to generate random number the following step:
Read the data from file (<DATA>)
Generate random numbers as many as the input data lines
The random number should not be generated twice,
e.g. if the rand number generated in loop no x, has been created
before then, recreate the random number.
Here is the code I have which leads to infinite loop.
What's wrong with my logic, and how can I fix it?
#!/usr/bin/perl -w
use strict;
my %chrsize = ('chr1' =>249250621);
# For example case I have created the
# repository where a value has been inserted.
my %done =("chr1 182881372" => 1);
while ( <DATA> ) {
chomp;
next if (/^\#/);
my ($chr,$pos) = split(/\s+/,$_);
# this number has been generated before
# with this: int(rand($chrsize{$chr}));
# hence have to create other than this one
my $newst =182881372;
my $newpos = $chr ."\t".$newst;
# recreate random number
for (0...10){
if ( $done{$newpos} ) {
# INFINITE LOOP
$newst = int(rand($chrsize{$chr}));
redo;
}
}
$done{$newpos}=1;
print "$newpos\n";
}
__DATA__
# In reality there are 20M of such lines
# name positions
chr1 157705682
chr1 19492676
chr1 169660680
chr1 226586538
chr1 182881372
chr1 11246753
chr1 69961084
chr1 180227256
chr1 141449512
You had a couple of errors:
You were setting $newst within your loop every time, so $newpos never took on a new value.
Your inner for loop didn't make sense, because you never actually changed $newpos before checking the condition again.
redo; was working on the inner loop.
Here is a corrected version that avoids redo altogether.
Update: I edited the algorithm a bit to make it simpler.
#!/usr/bin/perl -w
use strict;
my $chr1size = 249250621;
my %done;
my $newst;
while ( <DATA> ) {
chomp;
next if (/^\#/);
my ($chr,$pos) = split(/\s+/,$_);
my $newpos;
#This will always run at least once
do {
$newst = int(rand($chr1size));
$newpos = $chr ."\t".$newst;
} while ( $done{$newpos} );
$done{$newpos}=1;
print "$newpos\n";
}
Update 2: while the above algorithm will work, it will get really slow on 20,000,000 lines. Here is an alternative approach that should be faster (There is sort of a pattern to the random numbers it generates, but it would probably ok for most situations).
#!/usr/bin/perl -w
use strict;
my $newst;
#make sure you have enough. This is good if you have < 100,000,000 lines.
use List::Util qw/shuffle/;
my #rand_pieces = shuffle (0..10000);
my $pos1 = 0;
my $offset = 1;
while ( <DATA> ) {
chomp;
next if (/^\#/);
my ($chr,$pos) = split(/\s+/,$_);
$newst = $rand_pieces[$pos1] * 10000 + $rand_pieces[($pos1+$offset)%10000];
my $newpos = $chr ."\t".$newst;
$pos1++;
if ($pos1 > $#rand_pieces)
{
$pos1 = 0;
$offset = ++$offset % 10000;
if ($offset == 1) { die "Out of random numbers!"; }
}
print "$newpos\n";
}
Add a counter to your loop like this:
my $counter = 0;
# recrate
for (0...10){
if ( $done{$newpos} ) {
# INFINITE LOOP
$newst = int(rand($chr1size));
redo if ++$counter < 100; # Safety counter
# It will continue here if the above doesn't match and run out
# eventually
}
}
To get rid of the infinite loop, replace redo with next.
http://www.tizag.com/perlT/perlwhile.php :
"Redo will execute the same iteration over again."
Then you may need to fix the rest of the logic ;).
Related
My question is how to add a new position at the end of the file in Shell or Perl?
I have two files:
File A with 536382 lines and the key is third column:
abc1111,1070X00Y0,**9999**,B
abc2222,1070X00Y0,**9999**,B
abc3333,1070x00Y0,**9999**,B
File B with 946 lines and the key is the first column:
**9999**,Position,West
**9998**,Position,West
**9997**,Position,South
**1111**,Position,South
**9999**,Time,Morning
**9997**,Time,Afternoon
I want a combination of these two files:
abc1111,1070X00Y0,9999,B,West,Morning
abc2222,1070X00Y0,9999,B,West,Morning
abc3333,1070x00Y0,9999,B,West,Morning
I was trying shell script but I was receiving a message of out of memory.
So I open for suggestions.
Thank you, so far.
I was able to get the results you want by making a few changes to your code.
#!/usr/bin/perl
use strict;
use warnings;
open IN2, '<', \<<EOF;
**9999**,Position,West
**9998**,Position,West
**9997**,Position,South
**1111**,Position,South
**9999**,Time,Morning
**9997**,Time,Afternoon
EOF
my %hash;
while ( <IN2> ) {
chomp;
my #col2 = split ",";
$hash{$col2[0]}{$col2[1]} = $col2[2];
}
open IN1, '<', \<<EOF;
abc1111,1070X00Y0,**9999**,B
abc2222,1070X00Y0,**9999**,B
abc3333,1070x00Y0,**9999**,B
EOF
while ( <IN1> ) {
chomp;
my $key = (split /,/)[2];
if ( exists( $hash{$key} ) ) {
print join(",", $_, #{ $hash{$key} }{ qw/Position Time/ }), "\n";
}
}
This produced output of:
abc1111,1070X00Y0,**9999**,B,West,Morning
abc2222,1070X00Y0,**9999**,B,West,Morning
abc3333,1070x00Y0,**9999**,B,West,Morning
The changes to the code were
$hash{$col2[0]}{$col2[1]} = $col2[2]; Create a Hash of Hash to hold the Position and Time keys. They are used in a hash slice here
#{ $hash{$key} }{ qw/Position Time/ })
Convert small file into perl's hash
Process big file line by line
I have an array of strings that numbers, possibly separated by forward slashes, e.g. 754 or 754/128. These strings can have an undefined length, in other words: something such as the following is possible: 1234/34/21/120/3. In the array I want to only keep the more specialised patterns that contain other patterns. For instance, in the first example above 754/128 contains 754, so 754 can be removed from the array.
This concept of containing is as broad as one would expect, maybe even broader: it is similar to how you look at a directed graph where each slash in the pattern refers to a step forward. the contained pattern can be of arbitrary length as long as it is inside the containing pattern one way or another. This means that the small path can occur in any (chronologically correct) form. For instance, 903/900 is contained in 903/902/900 even though the pattern is 'split open'. One way to visualise this is: in the small path we go from point A to point B. In the larger path we also go from pont A to B but we make a stop at C. The larger path visits more places than the small path without missing out on anything. As a consequence, the smaller path can occur in any split-up form - as long as the order of the path is respected. As an example:
2/5 - 1/2/3/4/5
# included
5/2 - 1/2/3/4/5
# not included
What I mean here is that the position of the 'contained' items should be identical in the large path. For instance: 1/3/2 'matches' in 1/5/3/4/2 because the order is identical in the small and large path: 1 comes in a position before 3, which in turn is in some position before 2. 1/2/3, 2/1/3 and so on would NOT match the larger path 1/5/3/4/2 even though they are valid paths with the same items. This is because the order of occurrence is different.
The above example also illustrates that the items in the small pattern can occur any where in the large pattern; not only in the first and last position or in subsequent positions. In other words, all included paths for 1/2/3/4 are:
1/2
1/2/3
1/3
1/4
2/3
2/3/4
2/4
3/4
I am looking for an efficient way to remove paths in a given array that are included in others from that same array.
I got this far, but I'm not sure how I should efficiently check the contains relationship between two items.
#!/usr/bin/perl
my #arr = ("903/900", "903/902/900", "903/904/902/901", "903/904/902/908/900", "903");
my #res = ();
OUTER: for (my $i = 0; $i < #arr; $i++) {
my $first = $arr[$i];
my $j = $i+1;
INNER: while($j < #arr) {
my $second = $arr[$j];
&compare_paths($first, $second);
$j++;
}
}
sub compare_paths {
my ($first, $second) = #_;
#first_items = split(/\//, $first);
#second_items = split(/\//, $second);
# Compare values from 1 and 2
}
The expected output for the code above would be
#res = ("903/904/902/901", "903/904/902/908/900");
Reasons for removal:
903/900 contained in 903/902/900
903/902/900 contained in 903/904/902/908/900
903 contained in 903/904/902/901
How can I implement such algorithm efficiently? My main idea would be to check if an item of #first_items is a present in $second, if not continue, but if so check if a second item is also present and if so: check its substring position. This has to be larger than the first items substring position. Continue for each item (and the other way around for #second_items and $first) until all strings are matched. (If it helps for speed, the initial array can be swapped for a hash with the former array as the keys.)
I expect that there are general algorithms that can account for this problem, and probably libraries that can be leveraged. However, here is a hand-rolled one.
First, we sort the array by the number of terms in the path . Then we go up that array, comparing each element with all longer ones. This way each path is excluded at earliest opportunity.
The comparison is between arrays obtained by splitting on /. It checks whether all elements of the smaller array are in the larger one as an exact sub-sequence, so that the larger one would yield the smaller one by only removing elements (without rearrangements).
use warnings;
use strict;
my #arr = qw(902/904 903/900 903/902/900 903/904/902/901
903/904/902/908/900 903);
my #sorted = sort { (split '/', $a) > (split '/', $b) } #arr;
my #primes;
OUTER:
for my $i (0..$#sorted) {
for my $j ($i+1..$#sorted) {
next OUTER if is_contained($sorted[$i], $sorted[$j]);
}
push #primes, $sorted[$i];
}
print "#primes\n";
sub is_contained
{
my ($small, $large) = #_;
my #small = split '/', $small;
my #large = split '/', $large;
# There can be no duplicates so equal-length paths are distinct
return 0 if #small == #large;
# Indices of elements of #small in #large cannot decrease
my ($match, $index) = (0, 0);
for my $sm (#small) {
for my $i (0..$#large) {
$sm == $large[$i] || next;
return 0 if $i < $index; # out of order
$index = $i;
$match = 1;
last;
}
return 0 if not $match; # $sm from #small not in #large
$match = 0;
}
return 1;
}
Prints the line: 902/904 903/904/902/901 903/904/902/908/900.
A note on how we check that the #smaller matches a subsequence in #larger.
Once a #smaller element is found in #larger, its index in #larger
cannot be lower than the one found previously. An element must come after the previous one, not before. See a different procedure below.
So with 2/7/5 and 1/2/5/7/8, first 2 is found at index 1, then 7 at index 3, then 5 but at index 2. The subsequence 2-5-7 does not match 2-7-5. I added 902/904 to data to test for this.
This is an alternate procedure for checking whether a path is contained in another.
Once it finds an element of #smaller in #larger it searches for the next one starting from the next index in #larger. This way it skips the searched part of the path, but it cannot detect out-of-order elements early either.
With the example of 2/7/5 and 1/2/5/7/8, after it found 7 at index 3 it starts from index 4 and detects failure by not finding a 5 in the rest of the target path.
sub is_contained_2
{
my #large = split '/', $_[0];
my #small = split '/', $_[1];
# Is #small found in #large as an exact sub-sequence?
my ($match, $j) = (0, 0);
for my $sm (#small) {
for my $i ($j..$#large) {
$sm == $large[$i] || next;
$j = $i+1, $match = 1;
last;
}
return 0 if not $match;
$match = 0;
}
return 1;
}
This is slower (by 10-15%) for this data set, see benchmark with a comment below.
I benchmarked the two array-based versions here and ikegami's regex+trie. So far I have used only the specific data set from the question, with 902/904 added.
use warnings;
use strict;
use Benchmark qw(cmpthese);
my $secs_to_run = shift || 10;
my #arr = ('902/904', '903/900', '903/902/900', '903/904/902/901',
'903/904', '/902/908/900', '903');
# sorted array checked shorter-to-longer, manual iterations
sub contained {
my ($rarr) = #_; my #arr = #$arr;
# program copied from this post
return \#primes;
}
sub is_contained { ... } # copied
# Same program, but using is_contained_2()
sub contained_2 { ... }
sub is_contained_2 { ... }
# Regex-trie, copied from ikegami's post
sub add { my $p = \shift; $p = \( $$p->{$_} ) for #_, ''; }
sub as_pat { my $trie = shift; ... } # copied
sub regex_trie {
my ($rpaths) = #_; my #paths = #$rpaths;
# program copied from ikegami's post
return \#filtered_paths;
}
cmpthese(-$secs_to_run, {
containted => sub { my $rprimes = contained(\#arr) },
cont_next => sub { my $rprimes = contained_2(\#arr) },
regex_trie => sub { my $rfiltered = regex_trie(\#arr) },
});
With bench_cont.pl 300, on a newer workstation-laptop (2.5GHz) with v5.16
Rate regex_trie cont_next containted
regex_trie 15264/s -- -15% -27%
cont_next 17946/s 18% -- -14%
containted 20939/s 37% 17% --
on an older server (2.8GHz) with v5.16
Rate regex_trie cont_next containted
regex_trie 11750/s -- -13% -27%
cont_next 13537/s 15% -- -16%
containted 16042/s 37% 19% --
on an older server (3.5GHz) with v5.10
Rate cont_next regex_trie containted
cont_next 12266/s -- -17% -17%
regex_trie 14832/s 21% -- -0%
containted 14845/s 21% 0% --
This surprised me, as I expected the regex-based solution to be fastest.
I expect the trend to reverse for data composed of longer paths, having more distinct (not contained) paths, with containment found later in the path, and with a few out-of-order dismissals.
I'll add tests once I get to generate such data, or once it is provided.
To track some of the processing change the body to
use feature 'say';
OUTER:
for my $i (0..$#sorted) {
say "Check $sorted[$i]";
for my $j ($i+1..$#sorted) {
my $is_inside = is_contained($sorted[$i], $sorted[$j]);
say "\t$is_inside: $sorted_arr[$i] inside $sorted_arr[$j]";
next OUTER if $is_inside;
}
push #primes, $sorted[$i];
}
say "\nNot contained: #primes";
This prints
Check 903
0: 903 vs. 902/904
1: 903 vs. 903/900
Check 902/904
0: 902/904 vs. 903/900
0: 902/904 vs. 903/902/900
0: 902/904 vs. 903/904/902/901
0: 902/904 vs. 903/904/902/908/900
Check 903/900
1: 903/900 vs. 903/902/900
Check 903/902/900
0: 903/902/900 vs. 903/904/902/901
1: 903/902/900 vs. 903/904/902/908/900
Check 903/904/902/901
0: 903/904/902/901 vs. 903/904/902/908/900
Check 903/904/902/908/900
Not contained: 902/904 903/904/902/901 903/904/902/908/900
To optimize what you have so far, I'd suggest that you pre-split all elements of the array (and then re-merge them later):
#arr = map [split "/", $_], #arr;
With that done:
sub contains(\#\#) {
my ($larger_ref, $smaller_ref) = #_;
return '' if #$larger_ref <= #$smaller_ref;
my ($i, $j) = 0;
while ($i < #$larger_ref && $j <= #$smaller_ref) {
++$j if $larger_ref->[$i] == $smaller_ref->[$j];
++$i;
}
return $j == #$smaller_ref;
}
I: for (my $i = 0; $i < #arr; ++$i) {
J: for (my $j = 0; $j < #arr; ++$j) {
next J if $j == $i;
next I if contains #{$arr[$j]}, #{$arr[i]};
}
push #res, join '/', #{$arr[$i]};
}
There are some potential further optimizations you can make in contains (for example, it might make sense to abort early if/when #$larger_ref - $i < #$smaller_ref - $j), but you'd want to test: they might turn out to be pessimizations.
If a/b/c is path, you want to remove the paths for which the following is true:
"/$path" =~ m{ ^ (?:/a)?+ (?:/b)?+ (?:/c)?+ \z }x && $path ne 'a/b/c'
That can also be written as
"/$path" =~ m{ ^ (?:/a)?+ (?:/b)?+ (?:/c)?+ \z (?<! ^ /a/b/c ) }x
If both a/b/c, a/i/j, a/x/y and d/e/f are paths, you want to remove the paths for which the following is true:
"/$path" =~ m{
^ (?:/a)?+ (?:/b)?+ (?:/c)?+ \z (?<! ^ /a/b/c )
| ^ (?:/a)?+ (?:/i)?+ (?:/j)?+ \z (?<! ^ /a/i/j )
| ^ (?:/a)?+ (?:/x)?+ (?:/y)?+ \z (?<! ^ /a/x/y )
| ^ (?:/d)?+ (?:/e)?+ (?:/f)?+ \z (?<! ^ /d/e/f )
}x
We removed alot of the backtracking by using the possessive modifier (+), but there is still a possibility for backtracking because of the common prefixes. So let's remove them!
"/$path" =~ m{
^
(?: (?:/a)?+ (?: (?:/b)?+ (?:/c)?+ \z (?<! ^ /a/b/c )
| (?:/i)?+ (?:/j)?+ \z (?<! ^ /a/i/j )
| (?:/x)?+ (?:/y)?+ \z (?<! ^ /a/x/y )
)
| (?:/d)?+ (?:/e)?+ (?:/f)?+ \z (?<! ^ /d/e/f )
)
}x
Now we have an efficient solution!
The following uses a trie to do this to remove the common prefixes.
use strict;
use warnings;
use feature qw( say );
sub add {
my $p = \shift;
$p = \( $$p->{$_} ) for #_, '';
}
sub as_pat {
my $trie = shift;
my #sub_pats =
map { $_ eq '' ? '' : $_ . as_pat($trie->{$_}) }
keys(%$trie);
if (#sub_pats == 1) {
return $sub_pats[0];
} else {
return '(?:'.join('|', #sub_pats).')';
}
}
my #paths = ( "903/900", "903/902/900", "903/904/902/901", "903/904/902/908/900", "903" );
my $trie;
add($trie, map({ "(?:/\Q$_\E)?+" } split qr{/}), "\\z(?<!^/\Q$_\E)" )
for #paths;
my $pat = as_pat($trie);
my $re = qr{^(?:$pat)};
my #filtered_paths = grep { "/$_" !~ /$re/ } #paths;
say for #filtered_paths;
Say N is the number of paths, and O(M) bounds the length of the paths. Like earlier answers, this one completes in O(N2 * M2) time, so it doesn't scale any better. However, I believe you will find mine faster should you benchmark them.
Say N is the number of paths, and O(M) bounds the length of the paths, the earlier answers complete in O(N2 * M2) time.
The following solution is O(N * 2M). This means it can handle very large numbers of paths much much more efficiently, as long as the paths are rather short (because it effectively becomes O(N2) vs O(N)). It does require far more memory than the solutions in the earlier answers.
use strict;
use warnings;
use feature qw( say );
sub fetch {
my $trie = shift;
for (#_, '') {
return () if !$trie;
$trie = $trie->{$_}
}
return $trie;
}
sub add {
local *_helper = sub {
my $trie_ptr = \shift;
my $exact = shift;
if (#_) {
my $lead = shift(#_);
_helper($$trie_ptr->{$lead}, $exact, #_);
_helper($$trie_ptr, 0, #_);
} else {
if ($exact) {
$$trie_ptr->{''} ||= 1;
} else {
$$trie_ptr->{''} = 2;
}
}
};
my $trie_ptr = \shift;
return _helper($$trie_ptr, 1, #_);
}
my #paths = ( "903/900", "903/902/900", "903/904/902/901", "903/904/902/908/900", "903" );
my #split_paths = map [ split qr{/} ], #paths;
my $trie;
add($trie, #$_)
for #split_paths;
use Data::Dumper qw( Dumper );
local $Data::Dumper::Sortkeys = 1;
print(Dumper($trie));
my #filtered_paths =
map join('/', #$_),
grep { fetch($trie, #$_) == 1 }
#split_paths;
say for #filtered_paths;
I have two txt files, then I put them into hash'es, sequence => title.
In file DUOMENYS.txt title is known, in file "DUOTA.txt" title is unknown.
So for every sequence in file "DUOTA.txt" I need to find similar sequence in file DUOMENYS.txt and then print that known title.
I tried to make this with slimple matching, printing title with more than 90% sequence symbol matches, but I was told that it is wrong, and I have to do it other way, with this table: http://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt
I have to compare letters from known sequence and unknown sequence and get number
(-4)-9, if sum of all that numbers => length of sequence * 3, print that title
Example, ABW => TILE1, DUOMENYS.txt, ABF => UNKNOWNTITLE, DUOTA.txt,
A B W
A B F
4 4 1 sum = 9
length 3 x 3 = 9
9 => 9, true, print.
So the problem is I don't know how to make it happen....
#!/usr/bin/perl
use strict;
use Data::Dumper;
open (OUTPUT, ">OUTPUT.txt") or die "$!"; #Turimos vairenio sekos
open (DUOMENYS, "DUOMENYS.txt") or die "$!";
open (OUTPUT1, ">OUTPUT1.txt") or die "$!"; #Tiriamos sekos
open (DUOTA, "DUOTA.txt") or die "$!";
open (OUTPUT2, ">OUTPUT2.txt") or die "$!"; #rezultatai
open (MATRIX, "MATRIX.txt") or die "$!";
#--------------------DUOMENYS-HASH-----------------------------
#my $contentFile = $ARGV[0];
my $contentFile = <DUOMENYS>;
my %testHash = ();
my $currentKey = "";
my $seka = "";
my %nhash = ();
open(my $contentFH,"<",$contentFile);
while(my $contentLine = <DUOMENYS>){
chomp($contentLine);
next if($contentLine eq ""); # Empty lines.
if($contentLine =~ /^\>(.*)/){
$testHash{$currentKey} = $seka;
$currentKey= $1;
$seka = "";
}else{
$seka .= $contentLine;
}
}
#-------------------DUOTA-HASH-------------------------------------
#my $contentFile1 = $ARGV[0];
my $contentFile1 = <DUOTA>;
my %testHash1 = ();
my $currentKey1 = "";
my $seka1 = "";
my %nhash1 = ();
open(my $contentFH1,"<",$contentFile1);
while(my $contentLine1 = <DUOTA>){
chomp($contentLine1);
next if($contentLine1 eq ""); # Empty lines.
if($contentLine1 =~ /^\>(.*)/){
$testHash1{$currentKey1} = $seka1;
$currentKey1= $1;
$seka1 = "";
}else{
$seka1 .= $contentLine1;
}
}
#--------------------OUTPUT-HASH------------------------------------
%nhash = reverse %testHash;
print OUTPUT Dumper(\%nhash);
%nhash1 = reverse %testHash1;
print OUTPUT1 Dumper(\%nhash1);
#---------------------MATCHING---------------------------------------
my $klaidu_skaicius = 0;
my #sekos = keys %nhash;
my #duotos_sekos = keys %nhash1;
my $i = 0;
my $j = 0;
for($i = 0; $i <= scalar#sekos; $i++){
for($j = 0; $j <= scalar#duotos_sekos; $j++){
$klaidu_skaicius = (#sekos[$i] ^ #duotos_sekos[$j]) =~ tr/\0//c;
if($klaidu_skaicius <= length(#sekos[$i])/10){
print OUTPUT2 substr( $nhash{#sekos[$i]}, 0, 9 ), "\n";
}
else{
print OUTPUT2 "";
}
}
}
From comments
pastebin.com/7QnBDTDY – povilito May 30 at 11:57
Its too big (15mb) for pastebin.com – povilito May 30 at 12:01
filedropper.com/duomenys – povilito May 30 at 12:04
I think comparing "letter" with " "(space) or "" should give us number -4 – povilito May 30 at 12:28
It's enougth to find one title for one unknown sequence. – povilito May 30 at 12:45
So if there is 50 unknown sequences, output file should give us 50 titles, some tittles can be the same :)
Here is the basic version of solution to your problem, as per I understood. I considered both the sequence equal, you can make the changes easily if you need.
#!/usr/bin/perl
use strict;
use warnings;
# convert your matrix file to 2D hash
my %hash;
open my $fh, '<', 'matrix' or die "unable to open file: $!\n";
chomp(my $heading = <$fh>);
# strip out space from begining
$heading =~ s/^\s+//g;
my #headings = split (/\s+/, $heading);
while(<$fh>) {
chomp;
my #line = split;
my $key1 = shift #line;
foreach my $i( 0 .. $#line) {
$hash{$key1}{$headings[$i]} = $line[$i];
}
}
close $fh;
# Took from your example for reference
my $duameny = "ABW";
my $duota = "ABF";
my #duamenys = split (//,$duameny);
my #duotas = split (//,$duota);
# calculate sum from hash
# considering both sequences are equal
my $sum = 0;
foreach my $i (0 .. $#duamenys) {
$sum += $hash{$duamenys[$i]}{$duotas[$i]};
}
# calculate length from sequence
my $length = (length $duameny) * 3;
print "SUM: $sum, Length: $length\n";
if($sum >= $length) {
# probably you know how to print the title
# print the title from duamenys.txt
}
Below is summary of my approach.
First converted the matrix into 2D hash for lookup.
Calculated sum from the hash for sequences.
Calculated length from the sequences
Compare them and print the title if sum >= length .
output:
SUM: 9, Length: 9
[Strawberry Perl v5.16.3, Windows 7 x64, executing via cmd, eg c:\strawberry> perl test.pl 100000]
SYMPTOM: The following code: foreach (1..$ARGV[0]) { foo($_); }, executes roughly 20% slower than if I had included this extra line, before it: my $num = $ARGV[0];
QUESTION: Can anyone help me understand why?
Notice, in the second case, that after I initialize and set $num, I do not then use $num in the loop parameters. Were this the case, I could probably be convinced that repeatedly testing against $ARGV[0] in a forloop is somehow slower than a variable that I define myself... but this is not the case.
To track time, I use: use Time::HiRes; my $time = [Time::HiRes::gettimeofday()]; at the top of my script, and: print "\n1: ", Time::HiRes::tv_interval($time); at the bottom.
Confused!
Thanks,
Michael
EDIT
I am including the entire script, with a comment preceding the offending line... Interestingly, it looks like the time discrepancy is at least partially dependent on my redundant initialization of %h, as well as #chain... This is getting weird.
use Time::HiRes; my $time = [Time::HiRes::gettimeofday()];
#my $max=$ARGV[0];
my %h = (1=>1,89=>89);
$h{1}=1;
$h{89}=89;
my #chain=();
my $ans=0;
sub sum{my $o=0; foreach (#_){$o+=$_}; return $o;}
foreach (1..$ARGV[0]-1){
my $x=$_;
my #chain = ();
while(!exists($h{$x})){
push(#chain,$x);
$x = sum(map {$_**2} split('',$x));
}
foreach (#chain){$h{$_}=$h{$x} if !exists($h{$_});}
}
print "\n1: ", Time::HiRes::tv_interval($time);
foreach (1..$ARGV[0]){$ans++ if ($h{$_}==89);}
print "\n2: ", Time::HiRes::tv_interval($time);
On my system (perl 5.16.3 on GNU/Linux) there is no measurable difference. The standard deviation of the timings is larger than the difference between measurements of different versions.
For each variant of the script, 10 executions were performed. The $ARGV[0] was 3.5E5 in all cases (350000).
Without my $num = $ARGV[0]:
$ perl measure.pl
2.369921 2.38991 2.380969 4.419895 2.398861 2.420928 2.388721 2.368144 2.387212 2.386347
mean: 2.5910908
sigma: 0.609763793801797
With my $num = $ARGV[0]:
$ perl measure.pl
4.435764 2.419485 2.403696 2.401771 2.411345 2.466776 4.408127 2.416889 2.389191 2.397409
mean: 2.8150453
sigma: 0.803721101668365
The measure.pl script:
use strict; use warnings; use 5.016;
use List::Util 'sum';
my #times = map qx/perl your-algorithm.pl 3.5E5/, 1..10;
chomp #times;
say "#times";
say "mean: ", mean(#times);
say "sigma: ", sigma(#times);
sub mean { sum(#_)/#_ }
sub sigma {
my $mean = mean(#_);
my $variance = sum(map { ($_-$mean)**2 } #_) / #_;
sqrt $variance;
}
With your-algorithm.pl being reduced so that only one timing is printed:
foreach (1..$ARGV[0]){$ans++ if ($h{$_}==89);}
print Time::HiRes::tv_interval($time), "\n";
I'm having trouble calculating an average of specific numbers in column BELOW a specific text identifier using awk. I have two columns of data and I'm trying to start the average keying on a common identifier that repeats, which is 01/1991. So, awk should calc the average of all lines beginning with 01/1991, which repeats, using the next 21 lines with total count of rows for average = 22 for the total number of years 1991-2012. The desired output is an average of each TextID/Name entry for all the January's (01) for each year 1991 - 2012 show below:
TextID/Name 1
Avg: 50.34
TextID/Name 2
Avg: 45.67
TextID/Name 3
Avg: 39.97
...
sample data:
TextID/Name 1
01/1991, 57.67
01/1992, 56.43
01/1993, 49.41
..
01/2012, 39.88
TextID/Name 2
01/1991, 45.66
01/1992, 34.77
01/1993, 56.21
..
01/2012, 42.11
TextID/Name 3
01/1991, 32.22
01/1992, 23.71
01/1993, 29.55
..
01/2012, 35.10
continues with the same data for TextID/Name 4
I'm getting an answer using this code shown below but the average is starting to calculate BEFORE the specific identifier line and not on and below that line (01/1991).
awk '$1="01/1991" {sum+=$2} (NR%22==0){avg=sum/22;print"Average: "avg;sum=0;next}' myfile
Thanks and explanations of the solution is greatly appreciated! I have edited the original answer with more description - thank you again.
If you look at your file, the first field is "01/1991," with a comma at the end, not "01/1991". Also, NR%22==0 will look at line numbers divisible by 22, not 22 lines after the point it thinks you care about.
You can do something like this instead:
awk '
BEGIN { l=-1; }
$1 == "01/1991," {
l=22;
s=0;
}
l > 0 { s+=$2; l--; }
l == 0 { print s/22; l--; }'
It has a counter l that it sets to the number of lines to count, then it sums up that number of lines.
You may want to consider simply summing all lines from one 01/1991 to the next though, which might be more robust.
If you're allowed to use Perl instead of Awk, you could do:
#!/usr/bin/env perl
$start = 0;
$have_started = 0;
$count = 0;
$sum = 0;
while (<>) {
$line = $_;
# Grab the value after the date and comma
if ($line = /\d+\/\d+,\s+([\d\.]+)/) {
$val = $+;
}
# Start summing values after 01/1991
if (/01\/1991,\s+([\d\.]+)/) {
$have_started = 1;
$val = $+;
}
# If we have started counting,
if ($have_started) {
$count++;
$sum += $+;
}
}
print "Average of all values = " . $sum/$count;
Run it like so:
$ cat your-text-file.txt | above-perl-script.pl