Perl: how to combine consecutive page numbers? - windows

OS: Windows server 2012, so I don't have access to Unix utils
Activestate Perl 5.16. Sorry I cannot upgrade the OS or Perl, I'm stuck with it.
I did a google search and read about 10 pages from that, I find similar problems but not what I'm looking for.
I then did 3 searches here and found similar issues with SQL, R, XSLT, but not what I'm looking for.
I actually am not sure where to start so I don't even have code yet.
I'd like to combine consecutive page numbers into a page range. Input will be a series of numbers in an array.
Input as an array of numbers: my #a=(1,2,5)
Output as a string: 1-2, 5
Input ex: (1,2,3,5,7)
Output ex: 1-3, 5, 7
Input ex: (100,101,102,103,115,120,121)
Output ex: 100-103,115,120-121
Thank you for your help!
This is the only code I have so far.
sub procpages_old
# $aref = array ref to list of page numbers.
# $model = used for debugging.
# $zpos = used for debugging only.
{my($aref,$model,$zpos)=#_;
my $procname=(caller(0))[3];
my #arr=#$aref; # Array of page numbers.
my #newarr=();
my $i=0;
my $np1=0; # Page 1 of possible range.
my $np2=0; # Page 2 of possible range.
my $p1=0; # Page number to test.
my $p2=0;
my $newpos=0;
while ($i<$#arr)
{
$np1=$arr[$i];
$np2=getdata($arr[$i+1],'');
$p1=$np1;
$p2=$np2;
while ($p2==($p1+1)) # Consecutive page numbers?
{
$i++;
$p1=$a[$i];
$p2=getdata($a[$i+1],'');
}
$newarr[$newpos]=$np1.'-'.$p2;
$newpos++;
# End of loop
$i++;
}
my $pages=join(', ',#arr);
return $pages;
}

That's called an intspan. Use Set::IntSpan::Fast::XS.
use Set::IntSpan::Fast::XS qw();
my $s = Set::IntSpan::Fast::XS->new;
$s->add(100,101,102,103,115,120,121);
$s->as_string; # 100-103,115,120-121

This seems to do what you want.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
chomp;
say rangify(split /,/);
}
sub rangify {
my #nums = #_;
my #range;
for (0 .. $#nums) {
if ($_ == 0 or $nums[$_] != $nums[$_ - 1] + 1) {
push #range, [ $nums[$_] ];
} else {
push #{$range[-1]}, $nums[$_];
}
}
for (#range) {
if (#$_ == 1) {
$_ = $_->[0];
} else {
$_ = "$_->[0]-$_->[-1]";
}
}
return join ',', #range;
}
__DATA__
1,2,5
1,2,3,5,7
100,101,102,103,115,120,121
The rangify() function builds an array of arrays. It traverses your input list and if a number is just one more than the previous number then it adds the new number to the second-level array that's currently at the end of the first-level array. If the new number is not sequential, it adds a new second-level array at the end of the first level array.
Having built this data structure, we walk the first-level array, looking at each of the second-level arrays. If the second level array contains only one element then we know it's not a range, so we overwrite the value with the single number from the array. If it contains more than one element, then it's a range and we overwrite the value with the first and last elements separated with a hyphen.

So I managed to adjust this code to work for me. Pass your array of numbers into procpages() which will then call num2range().
######################################################################
# In:
# Out:
sub num2range
{
local $_ = join ',' => #_;
s/(?<!\d)(\d+)(?:,((??{$++1}))(?!\d))+/$1-$+/g;
tr/-,/, /;
return $_;
}
######################################################################
# Concatenate consecutive page numbers in array.
# In: array like (1,2,5,7,99,100,101)
# Out: string like "1-2, 6, 7, 99-101"
sub procpages
{my($aref,$model,$zpos)=#_;
my $procname=(caller(0))[3];
my #arr=#$aref;
my $pages=num2range(#arr);
$pages=~s/\,/\-/g; # Change comma to dash.
$pages=~s/ /\, /g; # Change space to comma and space.
#$pages=~s/\,/\, /g;
return $pages;
}

You probably have the best solution already with the Set::IntSpan::Fast::XS module, but assuming you want to take the opportunity to learn perl here's another perl-ish way to do it.
use strict;
use warnings;
my #nums = (1,2,5);
my $prev = -999; # assuming you only use positive values, this will work
my #out = ();
for my $num (#nums) {
# if we are continuing a sequence, add a hyphen unless we did last time
if ($num == $prev + 1) {
push (#out, '-') unless (#out and $out[-1] eq '-');
}
else {
# if we are breaking a sequence (#out ends in '-'), add the previous number first
if (#out and $out[-1] eq '-') {
push(#out, $prev);
}
# then add the current number
push (#out, $num);
}
# track the previous number
$prev = $num;
}
# add the final number if necessary to close the sequence
push(#out, $prev) if (#out and $out[-1] eq '-');
# join all values with comma
my $pages = join(',', #out);
# flatten the ',-,' sequence to a single '-'
$pages =~ s/,-,/-/g;
print "$pages\n";
This is not super elegant or short, but is very simple to understand and debug.

Related

Insert multiple characters in string at once

Where as str[] will replace a character, str.insert will insert a character at a position. But it requires two lines of code:
str = "COSO17123456"
str.insert 4, "-"
str.insert 7, "-"
=> "COSO-17-123456"
I was thinking how to do this in one line of code. I came up with the following solution:
str = "COSO17123456"
str.each_char.with_index.reduce("") { |acc,(c,i)| acc += c + ( (i == 3 || i == 5) ? "-" : "" ) }
=> "COSO-17-123456
Is there a built-in Ruby helper for this task? If not, should I stick with the insert option rather than combining several iterators?
Use each to iterate over an array of indices:
str = "COSO17123456"
[4, 7].each { |i| str.insert i, '-' }
str #=> "COSO-17-123456"
You can uses slices and .join:
> [str[0..3], str[4..5],str[6..-1]].join("-")
=> "COSO-17-123456"
Note that the index after the first one (between 3 and 4) will be different since you are not inserting earlier insertion first. ie, more natural (to me anyway...)
You will insert at the absolute index of the original string -- not the moving relative index as insertions are made.
If you want to insert at specific absolute index values, you can also use ..each_with_index and control the behavior character by character:
str2 = ""
tgts=[3,5]
str.split("").each_with_index { |c,idx| str2+=c; str2+='-' if tgts.include? idx }
Both of the above create a new string.
String#insert returns the string itself.
This means you can chain the method calls, which can be a prettier and more efficient if you only have to do it a couple of times like in your example:
str = "COSO17123456".insert(4, "-").insert(7, "-")
puts str
COSO-17-123456
Your reduce version can be therefore more concisely written as:
[4,7].reduce(str) { |str, idx| str.insert(idx, '-') }
I'll bring one more variation to the table, String#unpack:
new_str = str.unpack("A4A2A*").join('-')
# or with String#%
new_str = "%s-%s-%s" % str.unpack("A4A2A*")

Deleting lines with more than 30% lowercase letters

I try to process some data but I'am unable to find a working solution for my problem. I have a file which looks like:
>ram
cacacacacacacacacatatacacatacacatacacacacacacacacacacacacaca
cacacacacacacaca
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
>sam
AATTGGCCAATTGGCAATTCCGGAATTCaattggccaattccggaattccaattccgg
and many lines more....
I want to filter out all the lines and the corresponding headers (header starts with >) where the sequence string (those not starting with >) are containing 30 or more percent lowercase letters. And the sequence strings can span multiple lines.
So after command xy the output should look like:
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
I tried some mix of a while loop for reading the input file and then working with awk, grep, sed but there was no good outcome.
Here's one idea, which sets the record separator to ">" to treat each header with its sequence lines as a single record.
Because the input starts with a ">", which causes an initial empty record, we guard the computation with NR > 1 (record number greater than one).
To count the number of characters we add the lengths of all the lines after the header. To count the number of lower-case characters, we save the string in another variable and use gsub to replace all the lower-case letters with nothing --- just because gsub returns the number of substitutions made, which is a convenient way of counting them.
Finally we check the ratio and print or not (adding back the initial ">" when we do print).
BEGIN { RS = ">" }
NR > 1 {
total_cnt = 0
lower_cnt = 0
for (i=2; i<=NF; ++i) {
total_cnt += length($i)
s = $i
lower_cnt += gsub(/[a-z]/, "", s)
}
ratio = lower_cnt / total_cnt
if (ratio < 0.3) print ">"$0
}
$ awk -f seq.awk seq.txt
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
Or:
awk '{n=length(gensub(/[A-Z]/,"","g"));if(NF && n/length*100 < 30)print a $0;a=RT}' RS='>[a-z]+\n' file
RS='>[a-z]+\n' - Sets the record separator to the line containing '>' and name
RT - This value is set by what is matched by RS above
a=RT - save previous RT value
n=length(gensub(/[A-Z]/,"","g")); - get the length of lower case chars
if(NF && n/length*100 < 30)print a $0; - check we have a value and that the percentage is less than 30 for lower case chars
awk '/^>/{b=B;gsub( /[A-]/,"",b);
if( length( b) < length( B) * 0.3) print H "\n" B
H=$0;B="";next}
{B=( (B != "") ? B "\n" : "" ) $0}
END{ b=B;gsub( /[A-]/,"",b);
if( length( b) < length( B) * 0.3) print H "\n" B
}' YourFile
quick qnd dirty, a function suite better the need for printing
Nowadays I would not use sed or awk anymore for anything longer than 2 lines.
#! /usr/bin/perl
use strict; # Force variable declaration.
use warnings; # Warn about dangerous language use.
sub filter # Declare a sub-routing, a function called `filter`.
{
my ($header, $body) = #_; # Give the first two function arguments the names header and body.
my $lower = $body =~ tr/a-z//; # Count the translation of the characters a-z to nothing.
print $header, $body, "\n" # Print header, body and newline,
unless $lower / length ($body) > 0.3; # unless lower characters have more than 30%.
}
my ($header, $body); # Declare two variables for header and body.
while (<>) { # Loop over all lines from stdin or a file given in the command line.
if (/^>/) { # If the line starts with >,
filter ($header, $body) # call filter with header and body,
if defined $header; # if header is defined, which is not the case at the beginning of the file.
($header, $body) = ($_, ''); # Assign the current line to header and an empty string to body.
} else {
chomp; # Remove the newline at the end of the line.
$body .= $_; # Append the line to body.
}
}
filter ($header, $body); # Filter the last record.

How can I sort an array or table by column in Perl?

I've been looking everywhere for an answer to this, and I just can't get it to work.
I have an input file that is read into an array using Perl. The file is a text file containing a table. Perl reads it in as an array, with each element being a full line (including all five columns). This is what the array looks like:
0__len__340 16 324 0 0.0470588235294118
1__len__251 2 249 0 0.00796812749003984
2__len__497 0 497 0 0
3__len__55 7 48 0 0.127272727272727
4__len__171 0 171 0 0
5__len__75 0 75 0 0
6__len__160 75 85 0 0.46875
7__len__285 1 284 0 0.00350877192982456
8__len__94 44 50 0 0.468085106382979
I need to sort this table by the last column in descending order. So my output should be:
6__len__160 75 85 0 0.46875
8__len__94 44 50 0 0.468085106382979
3__len__55 7 48 0 0.127272727272727
0__len__340 16 324 0 0.0470588235294118
1__len__251 2 249 0 0.00796812749003984
7__len__285 1 284 0 0.00350877192982456
2__len__497 0 497 0 0
4__len__171 0 171 0 0
5__len__75 0 75 0 0
I've tried a few approaches, but none have worked. Here's the code I've tried:
#input = <FILENAME>;
#Close the file
close FILENAME;
my #fractions;
my $y = 0;
for (my $x = 1; $x <= $#input; ++$x) {
$fractions[$y] = (split (/\s/, $input[$x]))[4];
++$y;
}
my #sorted = sort {$b <=> $a} #fractions;
my $e = 1;
my $z = 0;
my $f = 0;
my #final;
do {
do {
if ((split (/\s/, $input[$e]))[4] == $sorted[$z]){
$final[$f] = $input[$e];
++$e;
++$f;
} else {
++$e;
}
} until ($e > $#input);
do {
++$z;
} until ($sorted[$z] != $sorted[$z - 1]);
$e = 0;
} until ($z > $#sorted);
for (my $h = 0; $h <= $#final; ++$h) {
print $final[$h] . "\n\n";
}
With this one, I basically tried to put the column 5 numbers into their own array, sort them, and then go back through the original array and pull out the elements that match the sorted array, and put them into the final array.
This may work if I keep working on it, but it takes so long to run that it's impractical. This small table I'm using to test my code with took a long time for this to run, and once the code is working it will be dealing with a table that has millions of rows.
I also tried applying the sort command to the table itself, but my output is the exact same table as my input...it doesn't get sorted.
#input = <FILENAME>;
close FILENAME;
my #sorted = sort { $b->[4] <=> $a->[4] } #input;
for (my $h = 0; $h <= $#sorted; ++$h) {
print $sorted[$h] . "\n\n";
}
exit;
Lastly, I tried to put the array into a hash where the key was the first four columns, since the first column name is unique, and the values being the fifth column.
Then I hoped I could sort the hash by the values and the keys would stay with their assigned values. I couldn't get this to work either, though unfortunately it was a couple days ago and I erased the code.
One problem was that I couldn't figure out how to split the string only before the fifth column, so I end up with two strings, one containing the first four columns and one containing the fifth.
What am I doing wrong with the sort command? Is there a better way to do this?
In your last code example you can replace
my #sorted = sort { $b->[4] <=> $a->[4] } #input;
with
my #sorted = sort { (split(' ', $b))[4] <=> (split(' ', $a))[4] } #input;
or even
my #sorted = sort { (split(/\s+/, $b))[4] <=> (split(/\s+/, $a))[4] } #input;
if input data has no lines with leading spaces.
In case this helps folks dropping by in the future - here are some inelegant attempts to sort() the content of lines.txt (data from question), by its fifth column, with a Perl one-liner. This should work:
perl -E 'say "#$_" for sort {$a->[4] <=> $b->[4]} map {[(split)]} <>' file
This is more or less the same thing but with the split "automated" with the autosplit (-a) switch which creates the #F array:
perl -anE 'push #t,[#F]}{say "#$_" for sort {$a->[4] <=> $b->[4]} #t' file
If the split pattern is not white space, you can substitute it for the default (\s+) shown here:
perl -E 'say sort {(split(/\s+/,$a))[4] <=> (split(/\s+/,$b))[4]} <>' file
This is the shortest way to sort and print the fifth column:
perl -E 'say for sort map{ (split)[4] } <>' file
Transforming the sort
Can we map, split and sort in one pass? This is a short way to sort the fifth column:
perl -E 'say for sort map{ [(split)[4], $_]->[0] } <>' file
Dissecting this last example: perl first maps the STDIN to split() - making a list; takes the fifth element (i.e. [4]) of this split() list and wraps that list item and the whole line that was just read ($_) inside an array constructor []; then takes the first element of that anonymous array (i.e. the fifth column of each line) and passes it to sort(). Phew!
This just prints the fifth column since we only passed the first element ->[0] of the anonymous array to sort. To print the whole line sorted by the column in this way we need to pass the whole anonymous array to sort and tell sort to use the element which holds the column's contents to do its work, and then pass the other element of the anonymous array (the one that holds the entire line) to print (or say) - this way we can sort by the fifth column, but print out the whole line:
perl -E 'say $_->[1] for sort{$a->[0] <=> $b->[0]} map{[(split)[4], $_]} <>' file
This is just like our very first example above. If, instead of running through the list that is created using for, we map the second element and pass it to print we get:
perl -E 'say map $_->[1], sort{$a->[0] <=> $b->[0]} map{[(split)[4],$_]} <>' file
We have reinvented the Schwartzian transform which is such a great Perl idiom that it is "built in" to Perl 6 ;-)
To get a sense of how this works you can "visualize" things with Data::Printer:
perl -MDDP -e '#t = sort map{ [ (split /\s+/)[4], $_ ] } <> ; p #t' file
Learn more about Perl idioms from the Perl Idioms Explained posts at perlmonks.org and the Perl Beyond Syntax chapter of Modern Perl.
You might also like the nsort_by function from List::UtilsBy:
use List::UtilsBy 'rev_nsort_by';
my #sorted = rev_nsort_by { (split(' ', $_))[4] } #input;

Script to find words inside a given word from wordlist

I have a dictionary with 250K words (txt file). For each of those words I would like to come up with a script that will throw all possible anagrams (each anagram should also be in the dictionary).
Ideally the script would output in this format:
word1: anagram1,anagram2...
word2: anagram1,anagram2...
Any help would be greatly appreciated.
Inspired by this, I would suggest you create a Trie.
Then, the trie with N levels will have all possible anagrams (where N is the length of the original word). Now, to get different sized words, I suggest you simply traverse the trie, ie. for all 3 letter subwords, just make all strings that are 3 levels deep in the trie.
I'm not really sure of this, because I didn't test this, but it's an interesting challenge, and this suggestion would be how I would start tackling it.
Hope it helps a little =)
It must be anagram week.
I'm going to refer you to an answer I submitted to a prior question: https://stackoverflow.com/a/12811405/128421. It shows how to build a hash for quick searches of words that have common letters.
For your purpose, of finding substrings/inner-words, you will also want to find the possible inner words. Here's how to quickly locate unique combinations of letters of varying sizes, based on a starting word:
word = 'misses'
word_letters = word.downcase.split('').sort
3.upto(word.length) { |i| puts word_letters.combination(i).map(&:join).uniq }
eim
eis
ems
ess
ims
iss
mss
sss
eims
eiss
emss
esss
imss
isss
msss
eimss
eisss
emsss
imsss
eimsss
Once you have those combinations, split them (or don't do the join) and do look-ups in the hash my previous answer built.
What I tried so far in Perl :
use strict;
use warnings;
use Algorithm::Combinatorics qw(permutations);
die "First argument should be a dict\n" unless $ARGV[0] or die $!;
open my $fh, "<", $ARGV[0] or die $!;
my #arr = <$fh>;
my $h = {};
map { chomp; $h->{lc($_)} = [] } #arr;
foreach my $word (#arr) {
$word = lc($word);
my $chars = [ ( $word =~ m/./g ) ];
my $it = permutations($chars);
while ( my $p = $it->next ) {
my $str = join "", #$p;
if ($str ne $word && exists $h->{$str}) {
push #{ $h->{$word} }, $str
unless grep { /^$str$/ } #{ $h->{$word} };
}
}
if (#{ $h->{$word} }) {
print "$word\n";
print "\t$_\n" for #{ $h->{$word} };
}
}
END{ close $fh; }
There's maybe some possible improvement for speed, but it works.
I use French dict from words archlinux package.
EXAMPLE
$ perl annagrammes.pl /usr/share/dict/french
abaissent
absentais
abstenais
abaisser
baissera
baserais
rabaisse
(...)
NOTE
To installl the perl module :
cpan -i Algorithm::Combinatorics
h = Hash.new{[]}
array_of_words.each{|w| h[w.downcase.chars.sort].push(w)}
h.values

aggregate totals when key changes in Perl

I have an input file with the following format
ant,1
bat,1
bat,2
cat,4
cat,1
cat,2
dog,4
I need to aggregate the col2 for each key (column1) so the result is:
ant,1
bat,3
cat,7
dog,4
Other considerations:
Assume that the input file is sorted
The input file is pretty large (about 1M rows), so I don't want to use an array and take up memory
Each input line should be processed as we read it, and move to the next line
I need to write the results to an outFile
I need to do this in Perl, but a pseudo-code or algorithm would help just as fine
Thanks!
This is what I came up with... want to see if this can be written better/elegant.
open infile, outFile
prev_line = <infile>;
print_line = $prev_line;
while(<>){
curr_line = $_;
#prev_cols=split(',', $prev_line);
#curr_cols=split(',', $curr_line);
if ( $prev_cols[0] eq $curr_cols[0] ){
$prev_cols[1] += curr_cols[1];
$print_line = "$prev_cols[0],$prev_cols[1]\n";
$print_flag = 0;
}
else{
$print outFile "$print_line";
$print_flag = 1;
$print_line = $curr_line;
}
$prev_line = $curr_line;
}
if($print_flag = 1){
print outFile "$curr_line";
}
else{
print outFile "$print_line";
}
#!/usr/bin/perl
use warnings;
use strict;
use integer;
my %a;
while (<>) {
my ($animal, $n) = /^\s*(\S+)\s*,\s*(\S+)/;
$a{$animal} += $n if defined $n;
}
print "$_,${a{$_}}\n" for sort keys %a;
This short code affords you the chance to learn Perl's excellent hash facility, as %a. Hashes are central to Perl. One really cannot write fluent Perl without them.
Observe incidentally that the code exercises Perl's interesting autovivification feature. The first time a particular animal is encountered in the input stream, no count exists, so Perl implicitly assumes a pre-existing count of zero. Thus, the += operator does not fail, even though it seems that it should. It just adds to zero in the first instance.
On the other hand, it may happen that not only the number of data but the number of animals is so large that one would not like to store the hash %a. In this case, one can still calculate totals, provided only that the data are sorted by animal in the input, as they are in your example. In this case, something like the following might suit (though regrettably it is not nearly so neat as the above).
#!/usr/bin/perl
use warnings;
use strict;
use integer;
my $last_animal = undef;
my $total_for_the_last_animal = 0;
sub start_new_animal ($$) {
my $next_animal = shift;
my $n = shift;
print "$last_animal,$total_for_the_last_animal\n"
if defined $last_animal;
$last_animal = $next_animal;
$total_for_the_last_animal = $n;
}
while (<>) {
my ($animal, $n) = /^\s*(\S+)\s*,\s*(\S+)/;
if (
defined($n) && defined($animal) && defined($last_animal)
&& $animal eq $last_animal
) { $total_for_the_last_animal += $n; }
else { start_new_animal $animal, $n; }
}
start_new_animal undef, 0;
Use Perl’s awk mode.
-a
turns on autosplit mode when used with a -n or -p. An implicit split command to the #F array is done as the first thing inside the implicit while loop produced by the -n or -p.
perl -ane 'print pop(#F), "\n";'
is equivalent to
while (<>) {
#F = split(' ');
print pop(#F), "\n";
}
An alternate delimiter may be specified using -F.
All that’s left for you is to accumulate the sums in a hash and print them.
$ perl -F, -lane '$s{$F[0]} += $F[1];
END { print "$_,$s{$_}" for sort keys %s }' input
Output:
ant,1
bat,3
cat,7
dog,4
It's trivial in perl. Loop on the file input. Split the input line on comma. For each key in column one keep a hash to which you add the value in column two. At the end of the file print the list of hash keys and their values. It can be done in one line but that would obfuscate the algorithm.

Resources