Compare strings and remove more general pattern in Perl - algorithm

I have an array of strings that numbers, possibly separated by forward slashes, e.g. 754 or 754/128. These strings can have an undefined length, in other words: something such as the following is possible: 1234/34/21/120/3. In the array I want to only keep the more specialised patterns that contain other patterns. For instance, in the first example above 754/128 contains 754, so 754 can be removed from the array.
This concept of containing is as broad as one would expect, maybe even broader: it is similar to how you look at a directed graph where each slash in the pattern refers to a step forward. the contained pattern can be of arbitrary length as long as it is inside the containing pattern one way or another. This means that the small path can occur in any (chronologically correct) form. For instance, 903/900 is contained in 903/902/900 even though the pattern is 'split open'. One way to visualise this is: in the small path we go from point A to point B. In the larger path we also go from pont A to B but we make a stop at C. The larger path visits more places than the small path without missing out on anything. As a consequence, the smaller path can occur in any split-up form - as long as the order of the path is respected. As an example:
2/5 - 1/2/3/4/5
# included
5/2 - 1/2/3/4/5
# not included
What I mean here is that the position of the 'contained' items should be identical in the large path. For instance: 1/3/2 'matches' in 1/5/3/4/2 because the order is identical in the small and large path: 1 comes in a position before 3, which in turn is in some position before 2. 1/2/3, 2/1/3 and so on would NOT match the larger path 1/5/3/4/2 even though they are valid paths with the same items. This is because the order of occurrence is different.
The above example also illustrates that the items in the small pattern can occur any where in the large pattern; not only in the first and last position or in subsequent positions. In other words, all included paths for 1/2/3/4 are:
1/2
1/2/3
1/3
1/4
2/3
2/3/4
2/4
3/4
I am looking for an efficient way to remove paths in a given array that are included in others from that same array.
I got this far, but I'm not sure how I should efficiently check the contains relationship between two items.
#!/usr/bin/perl
my #arr = ("903/900", "903/902/900", "903/904/902/901", "903/904/902/908/900", "903");
my #res = ();
OUTER: for (my $i = 0; $i < #arr; $i++) {
my $first = $arr[$i];
my $j = $i+1;
INNER: while($j < #arr) {
my $second = $arr[$j];
&compare_paths($first, $second);
$j++;
}
}
sub compare_paths {
my ($first, $second) = #_;
#first_items = split(/\//, $first);
#second_items = split(/\//, $second);
# Compare values from 1 and 2
}
The expected output for the code above would be
#res = ("903/904/902/901", "903/904/902/908/900");
Reasons for removal:
903/900 contained in 903/902/900
903/902/900 contained in 903/904/902/908/900
903 contained in 903/904/902/901
How can I implement such algorithm efficiently? My main idea would be to check if an item of #first_items is a present in $second, if not continue, but if so check if a second item is also present and if so: check its substring position. This has to be larger than the first items substring position. Continue for each item (and the other way around for #second_items and $first) until all strings are matched. (If it helps for speed, the initial array can be swapped for a hash with the former array as the keys.)

I expect that there are general algorithms that can account for this problem, and probably libraries that can be leveraged. However, here is a hand-rolled one.
First, we sort the array by the number of terms in the path . Then we go up that array, comparing each element with all longer ones. This way each path is excluded at earliest opportunity.
The comparison is between arrays obtained by splitting on /. It checks whether all elements of the smaller array are in the larger one as an exact sub-sequence, so that the larger one would yield the smaller one by only removing elements (without rearrangements).
use warnings;
use strict;
my #arr = qw(902/904 903/900 903/902/900 903/904/902/901
903/904/902/908/900 903);
my #sorted = sort { (split '/', $a) > (split '/', $b) } #arr;
my #primes;
OUTER:
for my $i (0..$#sorted) {
for my $j ($i+1..$#sorted) {
next OUTER if is_contained($sorted[$i], $sorted[$j]);
}
push #primes, $sorted[$i];
}
print "#primes\n";
sub is_contained
{
my ($small, $large) = #_;
my #small = split '/', $small;
my #large = split '/', $large;
# There can be no duplicates so equal-length paths are distinct
return 0 if #small == #large;
# Indices of elements of #small in #large cannot decrease
my ($match, $index) = (0, 0);
for my $sm (#small) {
for my $i (0..$#large) {
$sm == $large[$i] || next;
return 0 if $i < $index; # out of order
$index = $i;
$match = 1;
last;
}
return 0 if not $match; # $sm from #small not in #large
$match = 0;
}
return 1;
}
Prints the line: 902/904 903/904/902/901 903/904/902/908/900.
A note on how we check that the #smaller matches a subsequence in #larger.
Once a #smaller element is found in #larger, its index in #larger
cannot be lower than the one found previously. An element must come after the previous one, not before. See a different procedure below.
So with 2/7/5 and 1/2/5/7/8, first 2 is found at index 1, then 7 at index 3, then 5 but at index 2. The subsequence 2-5-7 does not match 2-7-5. I added 902/904 to data to test for this.
This is an alternate procedure for checking whether a path is contained in another.
Once it finds an element of #smaller in #larger it searches for the next one starting from the next index in #larger. This way it skips the searched part of the path, but it cannot detect out-of-order elements early either.
With the example of 2/7/5 and 1/2/5/7/8, after it found 7 at index 3 it starts from index 4 and detects failure by not finding a 5 in the rest of the target path.
sub is_contained_2
{
my #large = split '/', $_[0];
my #small = split '/', $_[1];
# Is #small found in #large as an exact sub-sequence?
my ($match, $j) = (0, 0);
for my $sm (#small) {
for my $i ($j..$#large) {
$sm == $large[$i] || next;
$j = $i+1, $match = 1;
last;
}
return 0 if not $match;
$match = 0;
}
return 1;
}
This is slower (by 10-15%) for this data set, see benchmark with a comment below.
I benchmarked the two array-based versions here and ikegami's regex+trie. So far I have used only the specific data set from the question, with 902/904 added.
use warnings;
use strict;
use Benchmark qw(cmpthese);
my $secs_to_run = shift || 10;
my #arr = ('902/904', '903/900', '903/902/900', '903/904/902/901',
'903/904', '/902/908/900', '903');
# sorted array checked shorter-to-longer, manual iterations
sub contained {
my ($rarr) = #_; my #arr = #$arr;
# program copied from this post
return \#primes;
}
sub is_contained { ... } # copied
# Same program, but using is_contained_2()
sub contained_2 { ... }
sub is_contained_2 { ... }
# Regex-trie, copied from ikegami's post
sub add { my $p = \shift; $p = \( $$p->{$_} ) for #_, ''; }
sub as_pat { my $trie = shift; ... } # copied
sub regex_trie {
my ($rpaths) = #_; my #paths = #$rpaths;
# program copied from ikegami's post
return \#filtered_paths;
}
cmpthese(-$secs_to_run, {
containted => sub { my $rprimes = contained(\#arr) },
cont_next => sub { my $rprimes = contained_2(\#arr) },
regex_trie => sub { my $rfiltered = regex_trie(\#arr) },
});
With bench_cont.pl 300, on a newer workstation-laptop (2.5GHz) with v5.16
Rate regex_trie cont_next containted
regex_trie 15264/s -- -15% -27%
cont_next 17946/s 18% -- -14%
containted 20939/s 37% 17% --
on an older server (2.8GHz) with v5.16
Rate regex_trie cont_next containted
regex_trie 11750/s -- -13% -27%
cont_next 13537/s 15% -- -16%
containted 16042/s 37% 19% --
on an older server (3.5GHz) with v5.10
Rate cont_next regex_trie containted
cont_next 12266/s -- -17% -17%
regex_trie 14832/s 21% -- -0%
containted 14845/s 21% 0% --
This surprised me, as I expected the regex-based solution to be fastest.
I expect the trend to reverse for data composed of longer paths, having more distinct (not contained) paths, with containment found later in the path, and with a few out-of-order dismissals.
I'll add tests once I get to generate such data, or once it is provided.
To track some of the processing change the body to
use feature 'say';
OUTER:
for my $i (0..$#sorted) {
say "Check $sorted[$i]";
for my $j ($i+1..$#sorted) {
my $is_inside = is_contained($sorted[$i], $sorted[$j]);
say "\t$is_inside: $sorted_arr[$i] inside $sorted_arr[$j]";
next OUTER if $is_inside;
}
push #primes, $sorted[$i];
}
say "\nNot contained: #primes";
This prints
Check 903
0: 903 vs. 902/904
1: 903 vs. 903/900
Check 902/904
0: 902/904 vs. 903/900
0: 902/904 vs. 903/902/900
0: 902/904 vs. 903/904/902/901
0: 902/904 vs. 903/904/902/908/900
Check 903/900
1: 903/900 vs. 903/902/900
Check 903/902/900
0: 903/902/900 vs. 903/904/902/901
1: 903/902/900 vs. 903/904/902/908/900
Check 903/904/902/901
0: 903/904/902/901 vs. 903/904/902/908/900
Check 903/904/902/908/900
Not contained: 902/904 903/904/902/901 903/904/902/908/900

To optimize what you have so far, I'd suggest that you pre-split all elements of the array (and then re-merge them later):
#arr = map [split "/", $_], #arr;
With that done:
sub contains(\#\#) {
my ($larger_ref, $smaller_ref) = #_;
return '' if #$larger_ref <= #$smaller_ref;
my ($i, $j) = 0;
while ($i < #$larger_ref && $j <= #$smaller_ref) {
++$j if $larger_ref->[$i] == $smaller_ref->[$j];
++$i;
}
return $j == #$smaller_ref;
}
I: for (my $i = 0; $i < #arr; ++$i) {
J: for (my $j = 0; $j < #arr; ++$j) {
next J if $j == $i;
next I if contains #{$arr[$j]}, #{$arr[i]};
}
push #res, join '/', #{$arr[$i]};
}
There are some potential further optimizations you can make in contains (for example, it might make sense to abort early if/when #$larger_ref - $i < #$smaller_ref - $j), but you'd want to test: they might turn out to be pessimizations.

If a/b/c is path, you want to remove the paths for which the following is true:
"/$path" =~ m{ ^ (?:/a)?+ (?:/b)?+ (?:/c)?+ \z }x && $path ne 'a/b/c'
That can also be written as
"/$path" =~ m{ ^ (?:/a)?+ (?:/b)?+ (?:/c)?+ \z (?<! ^ /a/b/c ) }x
If both a/b/c, a/i/j, a/x/y and d/e/f are paths, you want to remove the paths for which the following is true:
"/$path" =~ m{
^ (?:/a)?+ (?:/b)?+ (?:/c)?+ \z (?<! ^ /a/b/c )
| ^ (?:/a)?+ (?:/i)?+ (?:/j)?+ \z (?<! ^ /a/i/j )
| ^ (?:/a)?+ (?:/x)?+ (?:/y)?+ \z (?<! ^ /a/x/y )
| ^ (?:/d)?+ (?:/e)?+ (?:/f)?+ \z (?<! ^ /d/e/f )
}x
We removed alot of the backtracking by using the possessive modifier (+), but there is still a possibility for backtracking because of the common prefixes. So let's remove them!
"/$path" =~ m{
^
(?: (?:/a)?+ (?: (?:/b)?+ (?:/c)?+ \z (?<! ^ /a/b/c )
| (?:/i)?+ (?:/j)?+ \z (?<! ^ /a/i/j )
| (?:/x)?+ (?:/y)?+ \z (?<! ^ /a/x/y )
)
| (?:/d)?+ (?:/e)?+ (?:/f)?+ \z (?<! ^ /d/e/f )
)
}x
Now we have an efficient solution!
The following uses a trie to do this to remove the common prefixes.
use strict;
use warnings;
use feature qw( say );
sub add {
my $p = \shift;
$p = \( $$p->{$_} ) for #_, '';
}
sub as_pat {
my $trie = shift;
my #sub_pats =
map { $_ eq '' ? '' : $_ . as_pat($trie->{$_}) }
keys(%$trie);
if (#sub_pats == 1) {
return $sub_pats[0];
} else {
return '(?:'.join('|', #sub_pats).')';
}
}
my #paths = ( "903/900", "903/902/900", "903/904/902/901", "903/904/902/908/900", "903" );
my $trie;
add($trie, map({ "(?:/\Q$_\E)?+" } split qr{/}), "\\z(?<!^/\Q$_\E)" )
for #paths;
my $pat = as_pat($trie);
my $re = qr{^(?:$pat)};
my #filtered_paths = grep { "/$_" !~ /$re/ } #paths;
say for #filtered_paths;
Say N is the number of paths, and O(M) bounds the length of the paths. Like earlier answers, this one completes in O(N2 * M2) time, so it doesn't scale any better. However, I believe you will find mine faster should you benchmark them.

Say N is the number of paths, and O(M) bounds the length of the paths, the earlier answers complete in O(N2 * M2) time.
The following solution is O(N * 2M). This means it can handle very large numbers of paths much much more efficiently, as long as the paths are rather short (because it effectively becomes O(N2) vs O(N)). It does require far more memory than the solutions in the earlier answers.
use strict;
use warnings;
use feature qw( say );
sub fetch {
my $trie = shift;
for (#_, '') {
return () if !$trie;
$trie = $trie->{$_}
}
return $trie;
}
sub add {
local *_helper = sub {
my $trie_ptr = \shift;
my $exact = shift;
if (#_) {
my $lead = shift(#_);
_helper($$trie_ptr->{$lead}, $exact, #_);
_helper($$trie_ptr, 0, #_);
} else {
if ($exact) {
$$trie_ptr->{''} ||= 1;
} else {
$$trie_ptr->{''} = 2;
}
}
};
my $trie_ptr = \shift;
return _helper($$trie_ptr, 1, #_);
}
my #paths = ( "903/900", "903/902/900", "903/904/902/901", "903/904/902/908/900", "903" );
my #split_paths = map [ split qr{/} ], #paths;
my $trie;
add($trie, #$_)
for #split_paths;
use Data::Dumper qw( Dumper );
local $Data::Dumper::Sortkeys = 1;
print(Dumper($trie));
my #filtered_paths =
map join('/', #$_),
grep { fetch($trie, #$_) == 1 }
#split_paths;
say for #filtered_paths;

Related

Scaning and matching letter sequences using matrix

I have two txt files, then I put them into hash'es, sequence => title.
In file DUOMENYS.txt title is known, in file "DUOTA.txt" title is unknown.
So for every sequence in file "DUOTA.txt" I need to find similar sequence in file DUOMENYS.txt and then print that known title.
I tried to make this with slimple matching, printing title with more than 90% sequence symbol matches, but I was told that it is wrong, and I have to do it other way, with this table: http://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt
I have to compare letters from known sequence and unknown sequence and get number
(-4)-9, if sum of all that numbers => length of sequence * 3, print that title
Example, ABW => TILE1, DUOMENYS.txt, ABF => UNKNOWNTITLE, DUOTA.txt,
A B W
A B F
4 4 1 sum = 9
length 3 x 3 = 9
9 => 9, true, print.
So the problem is I don't know how to make it happen....
#!/usr/bin/perl
use strict;
use Data::Dumper;
open (OUTPUT, ">OUTPUT.txt") or die "$!"; #Turimos vairenio sekos
open (DUOMENYS, "DUOMENYS.txt") or die "$!";
open (OUTPUT1, ">OUTPUT1.txt") or die "$!"; #Tiriamos sekos
open (DUOTA, "DUOTA.txt") or die "$!";
open (OUTPUT2, ">OUTPUT2.txt") or die "$!"; #rezultatai
open (MATRIX, "MATRIX.txt") or die "$!";
#--------------------DUOMENYS-HASH-----------------------------
#my $contentFile = $ARGV[0];
my $contentFile = <DUOMENYS>;
my %testHash = ();
my $currentKey = "";
my $seka = "";
my %nhash = ();
open(my $contentFH,"<",$contentFile);
while(my $contentLine = <DUOMENYS>){
chomp($contentLine);
next if($contentLine eq ""); # Empty lines.
if($contentLine =~ /^\>(.*)/){
$testHash{$currentKey} = $seka;
$currentKey= $1;
$seka = "";
}else{
$seka .= $contentLine;
}
}
#-------------------DUOTA-HASH-------------------------------------
#my $contentFile1 = $ARGV[0];
my $contentFile1 = <DUOTA>;
my %testHash1 = ();
my $currentKey1 = "";
my $seka1 = "";
my %nhash1 = ();
open(my $contentFH1,"<",$contentFile1);
while(my $contentLine1 = <DUOTA>){
chomp($contentLine1);
next if($contentLine1 eq ""); # Empty lines.
if($contentLine1 =~ /^\>(.*)/){
$testHash1{$currentKey1} = $seka1;
$currentKey1= $1;
$seka1 = "";
}else{
$seka1 .= $contentLine1;
}
}
#--------------------OUTPUT-HASH------------------------------------
%nhash = reverse %testHash;
print OUTPUT Dumper(\%nhash);
%nhash1 = reverse %testHash1;
print OUTPUT1 Dumper(\%nhash1);
#---------------------MATCHING---------------------------------------
my $klaidu_skaicius = 0;
my #sekos = keys %nhash;
my #duotos_sekos = keys %nhash1;
my $i = 0;
my $j = 0;
for($i = 0; $i <= scalar#sekos; $i++){
for($j = 0; $j <= scalar#duotos_sekos; $j++){
$klaidu_skaicius = (#sekos[$i] ^ #duotos_sekos[$j]) =~ tr/\0//c;
if($klaidu_skaicius <= length(#sekos[$i])/10){
print OUTPUT2 substr( $nhash{#sekos[$i]}, 0, 9 ), "\n";
}
else{
print OUTPUT2 "";
}
}
}
From comments
pastebin.com/7QnBDTDY – povilito May 30 at 11:57
Its too big (15mb) for pastebin.com – povilito May 30 at 12:01
filedropper.com/duomenys – povilito May 30 at 12:04
I think comparing "letter" with " "(space) or "" should give us number -4 – povilito May 30 at 12:28
It's enougth to find one title for one unknown sequence. – povilito May 30 at 12:45
So if there is 50 unknown sequences, output file should give us 50 titles, some tittles can be the same :)
Here is the basic version of solution to your problem, as per I understood. I considered both the sequence equal, you can make the changes easily if you need.
#!/usr/bin/perl
use strict;
use warnings;
# convert your matrix file to 2D hash
my %hash;
open my $fh, '<', 'matrix' or die "unable to open file: $!\n";
chomp(my $heading = <$fh>);
# strip out space from begining
$heading =~ s/^\s+//g;
my #headings = split (/\s+/, $heading);
while(<$fh>) {
chomp;
my #line = split;
my $key1 = shift #line;
foreach my $i( 0 .. $#line) {
$hash{$key1}{$headings[$i]} = $line[$i];
}
}
close $fh;
# Took from your example for reference
my $duameny = "ABW";
my $duota = "ABF";
my #duamenys = split (//,$duameny);
my #duotas = split (//,$duota);
# calculate sum from hash
# considering both sequences are equal
my $sum = 0;
foreach my $i (0 .. $#duamenys) {
$sum += $hash{$duamenys[$i]}{$duotas[$i]};
}
# calculate length from sequence
my $length = (length $duameny) * 3;
print "SUM: $sum, Length: $length\n";
if($sum >= $length) {
# probably you know how to print the title
# print the title from duamenys.txt
}
Below is summary of my approach.
First converted the matrix into 2D hash for lookup.
Calculated sum from the hash for sequences.
Calculated length from the sequences
Compare them and print the title if sum >= length .
output:
SUM: 9, Length: 9

How can I make DOT correctly process UTF-8 to PostScript and have multiple graph/pages?

This dot source
graph A
{
a;
}
graph B
{
"Enûma Eliš";
}
when compiled with dot -Tps generates this error
Warning: UTF-8 input uses non-Latin1 characters which cannot be handled by this PostScript driver
I can fix the UTF-8 problem by passing -Tps:cairo but then only graph A is in the output -- it is truncated to a single page. The same happens with -Tpdf. There are no other postscript driver available on my installation.
I could split the graphs into separate files and concatenate them afterwards, but I'd rather not. Is there a way to have correct UTF-8 handling and multiple page output?
Generating PDF or SVG could bypass the encoding problem too.
dot -Tpdf chs.dot > chs.pdf
// or
dot -Tsvg chs.dot > chs.svg
Apparently the dot PS driver can't handle other encodings than the old ISO8859-1. I think it can't change fonts either.
One thing you can do is to run a filter to change dot's PostScript output. The following Perl program does that, it's an adaptation of some code I had. It changes the encoding from UTF-8 to a modified ISO encoding with extra characters replacing unused ones.
Of course, the output still depends on the font having the characters. Since dot (I think) only uses the default PostScript fonts, anything beyond the "standard latin" is out of the question...
It works with Ghostscript or with any interpreter which defines AdobeGlyphList.
The filter should be used this way:
dot -Tps graph.dot | perl reenc.pl > output.ps
Here it is:
#!/usr/bin/perl
use strict;
use warnings;
use open qw(:std :utf8);
my $ps = do { local $/; <STDIN> };
my %high;
my %in_use;
foreach my $char (split //, $ps) {
my $code = (unpack("C", $char))[0];
if ($code > 127) {
$high{$char} = $code;
if ($code < 256) {
$in_use{$code} = 1;
}
}
}
my %repl;
my $i = 128;
foreach my $char (keys %high) {
if ($in_use{$high{$char}}) {
$ps =~ s/$char/sprintf("\\%03o", $high{$char})/ge;
next;
}
while ($in_use{$i}) { $i++; }
$repl{$i} = $high{$char};
$ps =~ s/$char/sprintf("\\%03o", $i)/ge;
$i++;
}
my $psprocs = <<"EOPS";
/EncReplacements <<
#{[ join(" ", %repl) ]}
>> def
/RevList AdobeGlyphList length dict dup begin
AdobeGlyphList { exch def } forall
end def
% code -- (uniXXXX)
/uniX { 16 6 string cvrs dup length 7 exch sub exch
(uni0000) 7 string copy dup 4 2 roll putinterval } def
% font code -- glyphname
/unitoname { dup RevList exch known
{ RevList exch get }
{ uniX cvn } ifelse
exch /CharStrings get 1 index known not
{ pop /.notdef } if
} def
/chg-enc { dup length array copy EncReplacements
{ currentdict exch unitoname 2 index 3 1 roll put } forall
} def
EOPS
$ps =~ s{/Encoding EncodingVector def}{/Encoding EncodingVector chg-enc def};
$ps =~ s/(%%BeginProlog)/$1\n$psprocs/;
print $ps;

Algorithm to do numeric profile of the string

I have few file similar to below, and I am trying to do numeric profiling as mentioned in the image
>File Sample
attttttttttttttacgatgccgggggatgcggggaaatttccctctctctctcttcttctcgcgcgcg
aaaaaaaaaaaaaaagcgcggcggcgcggasasasasasasaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
I have to map each substring of size 2 and then map it to 33 value for different ptoperties and then add as per the window size of 5.
my %temp = (
aCount => {
aa =>2
}
cCount => {
aa => 0
}
);
My current implementation include as per below ,
while (<FILE>) {
my $line = $_;
chomp $line;
while ($line=~/(.{2})/og) {
$subStr = $1;
if (exists $temp{aCount}{$subStr}) {
push #{$temp{aCount_array}},$temp{aCount}{$subStr};
if (scalar(#{$temp{aCount_array}}) == $WINDOW_SIZE) {
my $sum = eval (join('+',#{$temp{aCount_array}}));
shift #{$temp{aCount_array}};
#Similar approach has been taken to other 33 rules
}
}
if (exists $temp{cCount}{$subStr}) {
#similar approach
}
$line =~s/.{1}//og;
}
}
is there any other approach to increase the speed of the overall process
Regular expressions are awesome, but they can be overkill when all you need are fixed width substrings. Alternatives are substr
$len = length($line);
for ($i=0; $i<$len; $i+=2) {
$subStr = substr($line,$i,2);
...
}
or unpack
foreach $subStr (unpack "(A2)*", $line) {
...
}
I don't know how much faster either of these will be than regular expressions, but I know how I would find out.

How Break The Infinite Loop in Perl Under Redo (generating random number)

I intend to generate random number the following step:
Read the data from file (<DATA>)
Generate random numbers as many as the input data lines
The random number should not be generated twice,
e.g. if the rand number generated in loop no x, has been created
before then, recreate the random number.
Here is the code I have which leads to infinite loop.
What's wrong with my logic, and how can I fix it?
#!/usr/bin/perl -w
use strict;
my %chrsize = ('chr1' =>249250621);
# For example case I have created the
# repository where a value has been inserted.
my %done =("chr1 182881372" => 1);
while ( <DATA> ) {
chomp;
next if (/^\#/);
my ($chr,$pos) = split(/\s+/,$_);
# this number has been generated before
# with this: int(rand($chrsize{$chr}));
# hence have to create other than this one
my $newst =182881372;
my $newpos = $chr ."\t".$newst;
# recreate random number
for (0...10){
if ( $done{$newpos} ) {
# INFINITE LOOP
$newst = int(rand($chrsize{$chr}));
redo;
}
}
$done{$newpos}=1;
print "$newpos\n";
}
__DATA__
# In reality there are 20M of such lines
# name positions
chr1 157705682
chr1 19492676
chr1 169660680
chr1 226586538
chr1 182881372
chr1 11246753
chr1 69961084
chr1 180227256
chr1 141449512
You had a couple of errors:
You were setting $newst within your loop every time, so $newpos never took on a new value.
Your inner for loop didn't make sense, because you never actually changed $newpos before checking the condition again.
redo; was working on the inner loop.
Here is a corrected version that avoids redo altogether.
Update: I edited the algorithm a bit to make it simpler.
#!/usr/bin/perl -w
use strict;
my $chr1size = 249250621;
my %done;
my $newst;
while ( <DATA> ) {
chomp;
next if (/^\#/);
my ($chr,$pos) = split(/\s+/,$_);
my $newpos;
#This will always run at least once
do {
$newst = int(rand($chr1size));
$newpos = $chr ."\t".$newst;
} while ( $done{$newpos} );
$done{$newpos}=1;
print "$newpos\n";
}
Update 2: while the above algorithm will work, it will get really slow on 20,000,000 lines. Here is an alternative approach that should be faster (There is sort of a pattern to the random numbers it generates, but it would probably ok for most situations).
#!/usr/bin/perl -w
use strict;
my $newst;
#make sure you have enough. This is good if you have < 100,000,000 lines.
use List::Util qw/shuffle/;
my #rand_pieces = shuffle (0..10000);
my $pos1 = 0;
my $offset = 1;
while ( <DATA> ) {
chomp;
next if (/^\#/);
my ($chr,$pos) = split(/\s+/,$_);
$newst = $rand_pieces[$pos1] * 10000 + $rand_pieces[($pos1+$offset)%10000];
my $newpos = $chr ."\t".$newst;
$pos1++;
if ($pos1 > $#rand_pieces)
{
$pos1 = 0;
$offset = ++$offset % 10000;
if ($offset == 1) { die "Out of random numbers!"; }
}
print "$newpos\n";
}
Add a counter to your loop like this:
my $counter = 0;
# recrate
for (0...10){
if ( $done{$newpos} ) {
# INFINITE LOOP
$newst = int(rand($chr1size));
redo if ++$counter < 100; # Safety counter
# It will continue here if the above doesn't match and run out
# eventually
}
}
To get rid of the infinite loop, replace redo with next.
http://www.tizag.com/perlT/perlwhile.php :
"Redo will execute the same iteration over again."
Then you may need to fix the rest of the logic ;).

Perl to Ruby conversion (multidimensional arrays)

I'm just trying to get my head around a multidimensional array creation from a perl script i'm currently converting to Ruby, I have 0 experience in Perl, as in i opened my first Perl script this morning.
Here is the original loop:
my $tl = {};
for my $zoom ($zoommin..$zoommax) {
my $txmin = lon2tilex($lonmin, $zoom);
my $txmax = lon2tilex($lonmax, $zoom);
# Note that y=0 is near lat=+85.0511 and y=max is near
# lat=-85.0511, so lat2tiley is monotonically decreasing.
my $tymin = lat2tiley($latmax, $zoom);
my $tymax = lat2tiley($latmin, $zoom);
my $ntx = $txmax - $txmin + 1;
my $nty = $tymax - $tymin + 1;
printf "Schedule %d (%d x %d) tiles for zoom level %d for download ...\n",
$ntx*$nty, $ntx, $nty, $zoom
unless $opt{quiet};
$tl->{$zoom} = [];
for my $tx ($txmin..$txmax) {
for my $ty ($tymin..$tymax) {
push #{$tl->{$zoom}},
{ xyz => [ $tx, $ty, $zoom ] };
}
}
}
and what i have so far in Ruby:
tl = []
for zoom in zoommin..zoommax
txmin = cm.tiles.xtile(lonmin,zoom)
txmax = cm.tiles.xtile(lonmax,zoom)
tymin = cm.tiles.ytile(latmax,zoom)
tymax = cm.tiles.ytile(latmin,zoom)
ntx = txmax - txmin + 1
nty = tymax - tymin + 1
tl[zoom] = []
for tx in txmin..txmax
for ty in tymin..tymax
tl[zoom] << xyz = [tx,ty,zoom]
puts tl
end
end
end
The part i'm unsure of is nested right at the root of the loops, push #{$tl->{$zoom}},{ xyz => [ $tx, $ty, $zoom ] };
I'm sure this will be very simple for a seasoned Perl programmer, thanks! `
The Perl code is building up a complex data structure in $tl -- hash, array, hash, array:
$tl{$zoom}[i]{xyz}[j] = $tx # j = 0
$tl{$zoom}[i]{xyz}[j] = $ty # j = 1
$tl{$zoom}[i]{xyz}[j] = $zoom # j = 2
So I think the key line in your Ruby code should be like this:
tl[zoom] << { 'xzy' => [tx,ty,zoom] }
Note also that the root item ($tl) refers to a hash in the Perl code, while your Ruby code initializes it to be an array. That difference might cause problems for you, depending on the values that $zoom takes.

Resources