aggregate totals when key changes in Perl - algorithm

I have an input file with the following format
ant,1
bat,1
bat,2
cat,4
cat,1
cat,2
dog,4
I need to aggregate the col2 for each key (column1) so the result is:
ant,1
bat,3
cat,7
dog,4
Other considerations:
Assume that the input file is sorted
The input file is pretty large (about 1M rows), so I don't want to use an array and take up memory
Each input line should be processed as we read it, and move to the next line
I need to write the results to an outFile
I need to do this in Perl, but a pseudo-code or algorithm would help just as fine
Thanks!
This is what I came up with... want to see if this can be written better/elegant.
open infile, outFile
prev_line = <infile>;
print_line = $prev_line;
while(<>){
curr_line = $_;
#prev_cols=split(',', $prev_line);
#curr_cols=split(',', $curr_line);
if ( $prev_cols[0] eq $curr_cols[0] ){
$prev_cols[1] += curr_cols[1];
$print_line = "$prev_cols[0],$prev_cols[1]\n";
$print_flag = 0;
}
else{
$print outFile "$print_line";
$print_flag = 1;
$print_line = $curr_line;
}
$prev_line = $curr_line;
}
if($print_flag = 1){
print outFile "$curr_line";
}
else{
print outFile "$print_line";
}

#!/usr/bin/perl
use warnings;
use strict;
use integer;
my %a;
while (<>) {
my ($animal, $n) = /^\s*(\S+)\s*,\s*(\S+)/;
$a{$animal} += $n if defined $n;
}
print "$_,${a{$_}}\n" for sort keys %a;
This short code affords you the chance to learn Perl's excellent hash facility, as %a. Hashes are central to Perl. One really cannot write fluent Perl without them.
Observe incidentally that the code exercises Perl's interesting autovivification feature. The first time a particular animal is encountered in the input stream, no count exists, so Perl implicitly assumes a pre-existing count of zero. Thus, the += operator does not fail, even though it seems that it should. It just adds to zero in the first instance.
On the other hand, it may happen that not only the number of data but the number of animals is so large that one would not like to store the hash %a. In this case, one can still calculate totals, provided only that the data are sorted by animal in the input, as they are in your example. In this case, something like the following might suit (though regrettably it is not nearly so neat as the above).
#!/usr/bin/perl
use warnings;
use strict;
use integer;
my $last_animal = undef;
my $total_for_the_last_animal = 0;
sub start_new_animal ($$) {
my $next_animal = shift;
my $n = shift;
print "$last_animal,$total_for_the_last_animal\n"
if defined $last_animal;
$last_animal = $next_animal;
$total_for_the_last_animal = $n;
}
while (<>) {
my ($animal, $n) = /^\s*(\S+)\s*,\s*(\S+)/;
if (
defined($n) && defined($animal) && defined($last_animal)
&& $animal eq $last_animal
) { $total_for_the_last_animal += $n; }
else { start_new_animal $animal, $n; }
}
start_new_animal undef, 0;

Use Perl’s awk mode.
-a
turns on autosplit mode when used with a -n or -p. An implicit split command to the #F array is done as the first thing inside the implicit while loop produced by the -n or -p.
perl -ane 'print pop(#F), "\n";'
is equivalent to
while (<>) {
#F = split(' ');
print pop(#F), "\n";
}
An alternate delimiter may be specified using -F.
All that’s left for you is to accumulate the sums in a hash and print them.
$ perl -F, -lane '$s{$F[0]} += $F[1];
END { print "$_,$s{$_}" for sort keys %s }' input
Output:
ant,1
bat,3
cat,7
dog,4

It's trivial in perl. Loop on the file input. Split the input line on comma. For each key in column one keep a hash to which you add the value in column two. At the end of the file print the list of hash keys and their values. It can be done in one line but that would obfuscate the algorithm.

Related

Perl: how to combine consecutive page numbers?

OS: Windows server 2012, so I don't have access to Unix utils
Activestate Perl 5.16. Sorry I cannot upgrade the OS or Perl, I'm stuck with it.
I did a google search and read about 10 pages from that, I find similar problems but not what I'm looking for.
I then did 3 searches here and found similar issues with SQL, R, XSLT, but not what I'm looking for.
I actually am not sure where to start so I don't even have code yet.
I'd like to combine consecutive page numbers into a page range. Input will be a series of numbers in an array.
Input as an array of numbers: my #a=(1,2,5)
Output as a string: 1-2, 5
Input ex: (1,2,3,5,7)
Output ex: 1-3, 5, 7
Input ex: (100,101,102,103,115,120,121)
Output ex: 100-103,115,120-121
Thank you for your help!
This is the only code I have so far.
sub procpages_old
# $aref = array ref to list of page numbers.
# $model = used for debugging.
# $zpos = used for debugging only.
{my($aref,$model,$zpos)=#_;
my $procname=(caller(0))[3];
my #arr=#$aref; # Array of page numbers.
my #newarr=();
my $i=0;
my $np1=0; # Page 1 of possible range.
my $np2=0; # Page 2 of possible range.
my $p1=0; # Page number to test.
my $p2=0;
my $newpos=0;
while ($i<$#arr)
{
$np1=$arr[$i];
$np2=getdata($arr[$i+1],'');
$p1=$np1;
$p2=$np2;
while ($p2==($p1+1)) # Consecutive page numbers?
{
$i++;
$p1=$a[$i];
$p2=getdata($a[$i+1],'');
}
$newarr[$newpos]=$np1.'-'.$p2;
$newpos++;
# End of loop
$i++;
}
my $pages=join(', ',#arr);
return $pages;
}
That's called an intspan. Use Set::IntSpan::Fast::XS.
use Set::IntSpan::Fast::XS qw();
my $s = Set::IntSpan::Fast::XS->new;
$s->add(100,101,102,103,115,120,121);
$s->as_string; # 100-103,115,120-121
This seems to do what you want.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
chomp;
say rangify(split /,/);
}
sub rangify {
my #nums = #_;
my #range;
for (0 .. $#nums) {
if ($_ == 0 or $nums[$_] != $nums[$_ - 1] + 1) {
push #range, [ $nums[$_] ];
} else {
push #{$range[-1]}, $nums[$_];
}
}
for (#range) {
if (#$_ == 1) {
$_ = $_->[0];
} else {
$_ = "$_->[0]-$_->[-1]";
}
}
return join ',', #range;
}
__DATA__
1,2,5
1,2,3,5,7
100,101,102,103,115,120,121
The rangify() function builds an array of arrays. It traverses your input list and if a number is just one more than the previous number then it adds the new number to the second-level array that's currently at the end of the first-level array. If the new number is not sequential, it adds a new second-level array at the end of the first level array.
Having built this data structure, we walk the first-level array, looking at each of the second-level arrays. If the second level array contains only one element then we know it's not a range, so we overwrite the value with the single number from the array. If it contains more than one element, then it's a range and we overwrite the value with the first and last elements separated with a hyphen.
So I managed to adjust this code to work for me. Pass your array of numbers into procpages() which will then call num2range().
######################################################################
# In:
# Out:
sub num2range
{
local $_ = join ',' => #_;
s/(?<!\d)(\d+)(?:,((??{$++1}))(?!\d))+/$1-$+/g;
tr/-,/, /;
return $_;
}
######################################################################
# Concatenate consecutive page numbers in array.
# In: array like (1,2,5,7,99,100,101)
# Out: string like "1-2, 6, 7, 99-101"
sub procpages
{my($aref,$model,$zpos)=#_;
my $procname=(caller(0))[3];
my #arr=#$aref;
my $pages=num2range(#arr);
$pages=~s/\,/\-/g; # Change comma to dash.
$pages=~s/ /\, /g; # Change space to comma and space.
#$pages=~s/\,/\, /g;
return $pages;
}
You probably have the best solution already with the Set::IntSpan::Fast::XS module, but assuming you want to take the opportunity to learn perl here's another perl-ish way to do it.
use strict;
use warnings;
my #nums = (1,2,5);
my $prev = -999; # assuming you only use positive values, this will work
my #out = ();
for my $num (#nums) {
# if we are continuing a sequence, add a hyphen unless we did last time
if ($num == $prev + 1) {
push (#out, '-') unless (#out and $out[-1] eq '-');
}
else {
# if we are breaking a sequence (#out ends in '-'), add the previous number first
if (#out and $out[-1] eq '-') {
push(#out, $prev);
}
# then add the current number
push (#out, $num);
}
# track the previous number
$prev = $num;
}
# add the final number if necessary to close the sequence
push(#out, $prev) if (#out and $out[-1] eq '-');
# join all values with comma
my $pages = join(',', #out);
# flatten the ',-,' sequence to a single '-'
$pages =~ s/,-,/-/g;
print "$pages\n";
This is not super elegant or short, but is very simple to understand and debug.

How can I read a CSV file if only non-empty fields are wrapped by double quotes?

I'm trying to read a CSV file in a Bash script. I achieved that successfully using gawk and specifying FPAT like:
gawk -v LOGFILE="${LOGFILE}" 'BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"
}
NR == 1{
# doing some logic with header
}
NR >= 2{
# doing some logic with fields
}' <filename>
The problem here is, the file contains data like:
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"
Now, with this data I'm getting wrong data because it is ignoring commas, which is giving me wrong position number of extracted data.
For example, it is telling "7865431234" is present at 3rd position whereas it is at 6th.
Can anyone suggest the changes to get the correct position of fields?
Your FPAT requires each field to contain at least one character, but you want to recognize empty fields with zero characters. Add an alternative to FPAT that allows zero characters:
gawk 'BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")|" }
{ printf "%d:%d:", NR, NF; for (i = 1; i <= NF; i++) printf("[%s]", $i); print "" }'
Note the extra | at the end of FPAT. The action simply identifies the record number, the number of fields, and surrounds the value of each field with square brackets.
When your data string is provided to that script, the output is:
1:8:["RAM"]["31st street, Bengaluru, India"][][][]["7865431234"][]["VALID"]
That shows the four empty fields quite clearly.
Now all you have to do is deal with:
"Mr ""Manipulator"", the Artisan","29th Street, Delhi, India",,,"",,,"INVALID"
where there are double quotes inside the quoted value. That's not dreadfully hard to manage:
gawk 'BEGIN { FPAT = "([^,]+)|(\"([^\"]|\"\")*\")[^,]*|" }
{ printf "%d:%d:", NR, NF; for (i = 1; i <= NF; i++) printf("%d[%s]", i, $i); print "" }' "$#"
The FPAT says that a field is:
a sequence of non-commas,
or it is a field started with a double quote, containing zero or more instances of either:
a non-quote, or
two double quotes
followed by a double quote and optional non-comma data
or it is empty
Note that the 'optional non-comma data' should be empty, and only appears in malformed CSV data.
Given input data:
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"
"Mr ""Manipulator"", the Artisan","29th Street, Delhi, India",,,,,,"INVALID"
"Some","","Empty","",Fields "" Wrapped,"",in quotes
"Malformed" CSV,Data,"Note it has data after" a close quote,"and before a comma,",,"INVALID"
This produces:
1:8:1["RAM"]2["31st street, Bengaluru, India"]3[]4[]5[]6["7865431234"]7[]8["VALID"]
2:8:1["Mr ""Manipulator"", the Artisan"]2["29th Street, Delhi, India"]3[]4[]5[]6[]7[]8["INVALID"]
3:7:1["Some"]2[""]3["Empty"]4[""]5[Fields "" Wrapped]6[""]7[in quotes]
4:6:1["Malformed" CSV]2[Data]3["Note it has data after" a close quote]4["and before a comma,"]5[]6["INVALID"]
Note that the field numbers are included as a prefix to the bracketed data (so I tweaked the print format slightly).
About the only format this doesn't handle is one where newlines can be embedded in the data for a field — by the nature of the line-based input, it assumes that no field is split over multiple lines. (It also means it won't properly recognize a field that starts with a double quote and doesn't have a matching double quote before the end of the line. I suppose you could add an alternative to recognize that. It would be better just to make the data right.)
Note the advice in Sobrique's answer to use a tool designed to handle CSV for handling CSV. That is generally a good idea, and the more complex the sets of variations you have to deal with, the better an idea it is. This is close to as complicated a regex as you should consider using. Also note that although RFC 4180 defines a version of CSV formally and rigorously, there are multiple programs (including MS Office) that handle different but related formats.
If you have csv that needs parsing, then whilst you can usually hack it with a regex, it's far easier to user a parser.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV -> new;
open ( my $input, '<', 'flarg.csv' ) or die $!;
while ( my $row = $csv -> getline ( $input ) ) {
if ( $. == 1 ) {
# do first row stuff;
print "Header: ", join ",", #$row,"\n";
}
else {
print join "\n", #$row;
}
}
Or simpler yet - use Text::ParseWords which is core.
#!/usr/bin/env perl
use strict;
use warnings;
use Text::ParseWords;
while ( my $line = <DATA> ) {
my #fields = parse_line(',', 1, $line);
print join "\n", #fields;
}
__DATA__
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"

perl text mining code can't handle massive amounts of data

I'm doing a large text mining project. I have 100,000 text files. I've extracted two- and three-word phrases from sets of 1,000 documents at a time and have created 100 files. Each file has roughly 8 million lines in this format:
total_references num_docs_referencing_phrase phrase
I want to create an aggregate list of total references and number of docs referencing each phrase by processing the 100 intermediate files. To that end I wrote this program.
#!/usr/bin/perl -w
$| = 1 ; # Don't buffer output
use File::Find ;
$dir = "/home/sl/phrase-counts" ;
find(\&processFile, $dir) ;
for $key ( keys %TOTALREFS ) {
print "$TOTALREFS{$key} $NUMDOCS{$key} ${key}\n" ;
}
sub processFile {
my $file = $_ ;
my $fullName = $File::Find::name ;
if ( $fullName =~ /\.txt$/ ) {
$date = `date` ;
chomp $date ;
print "($date) file: $fullName\n" ;
open INFILE, "$fullName" or die "Cannot read ${fullName}";
while ( <INFILE> ) {
my $line = $_ ;
chomp $line ;
( $totalRefs, $numDocs, $phrase ) = split (/\s+/, $line, 3) ;
$TOTALREFS{$phrase} += $totalRefs ;
$NUMDOCS{$phrase} += $numDocs ;
}
close ( INFILE ) ;
}
}
The code produces strange errors after 8 or so files are processed and then it hangs, i.e. it stops listing files it should be processing.
Use of uninitialized value $date in scalar chomp at ./getCounts line 21.
Use of uninitialized value $date in concatenation (.) or string at ./getCounts line 22.
I don't believe the problem is really my date command, especially since it runs fine for a number of early files processed and because the problem does not occur at the same point in the run every time I run it. I assume the problem is that my program is consuming too much system resource and corrupting the state of the running environment. Running top and watching memory use go up to 97% of the machine concerns me although I notice that the errors and hang occur before top shows little memory left. And, there is some swap on the machine.
My question is, how can I rewrite this program to actually complete its execution? With 8 million lines of data for each of 100 files there could be 800 million lines of output although I would guess that the total is more likely in the range of 50-100 million lines. I have done some cleanup of the data and could consider more aggressive sanitizing of phrases to cut down on the numbers but I'd like to understand how I can design this code better.
I've seen articles that tell programmers to put their data into a database. My concern is the time it might take to update a database 100 million times.
Suggestions?
It looks like you're running on a *nix system, so make sort do all the work for you. It knows how to use memory efficiently.
sort -k 3 all_your_input_files*.txt > sorted.txt
Why do this? Because now all lines corresponding to the same phrase appear in a single block within the file, so you can compute totals easily: just write a short Perl script that adds the current line's numbers to the current totals, and writes them out whenever the phrase changes from the previous line (and at the end):
my ($oldPhrase, $totTotalRefs, $totNumDocs) = (undef, 0, 0);
while ( <INFILE> ) {
my $line = $_ ;
chomp $line ;
( $totalRefs, $numDocs, $phrase ) = split (/\s+/, $line, 3) ;
if (defined($oldPhrase) && $phrase ne $oldPhrase) {
print "$totTotalRefs $totNumDocs $oldPhrase\n" ;
$totTotalRefs = $totNumDocs = 0;
}
$totTotalRefs += $totalRefs ;
$totNumDocs += $numDocs ;
$oldPhrase = $phrase;
}
close ( INFILE ) ;
print "$totTotalRefs $totNumDocs $oldPhrase\n" ;
The above code is untested, but should work with appropriate boilerplate added I think.
[EDIT: Fix bug in which $oldPhrase never gets set, as suggested by Sol.]
You are storing all of the different phrases as keys for both %TOTALREFS and %NUMDOCS, so things are at least twice as bad as they need to be.
I suggest you try the following
Add use strict and use warnings (instead of -w) and declare all of your variables properly
Don't use capitals in your variable names. Capital letters are reserved for global identifiers
Don't start 100 subprocesses just to get the time of day. Just use localtime like this
printf "(%s) file: %s\n", scalar localtime, $full_name;
Use find just to generate an array of the files to be processed, so it would look like this
my #files;
find(sub {
push #files, $File::Find::name if -f and /\.txt$/i;
}, $dir) ;
Then you can process each file with a simple for loop
for my $file (#files) {
...
}
Take two passes through the files, the first time generating a hash that relates each phrase to an integer starting at zero, and the second that uses those integers to index arrays #total_refs and #num_docs and increment their elements
You may still run out of memory, but those measures will certainly give you a better chance.
Update
Just to be clear, this is how I imagine it would work. I've done this as a single pass, but it may be better to write it as two passes as I described so that you can check your intermediate data.
Note that this isn't tested apart from making sure that it compiles.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use autodie;
STDOUT->autoflush;
use File::Find;
my $dir = '/home/sl/phrase-counts';
my #files;
find(sub {
push #files, $File::Find::name if -f and /\.txt$/i;
}, $dir);
my (%phrases, #total_refs, #num_docs);
my $num_phrases = 0;
for my $file (#files) {
printf "(%s) file: %s\n", scalar localtime, $file;
open my $in_fh, '<', $file;
while (<$in_fh>) {
chomp;
my ($total_refs, $num_docs, $phrase) = split ' ', $_, 3;
my $phrase_num = $phrases{$phrase} //= $num_phrases++;
$total_refs[$phrase_num] += $total_refs;
$num_docs[$phrase_num] += $num_docs;
}
}
for my $phrase (keys %phrases) {
my $phrase_num = $phrases{$phrase};
printf "%s %s %s\n",
$total_refs[$phrase_num],
$num_docs[$phrase_num],
$phrase_num;
}
Trying to use more resources than available causes exceptions for being unable to allocate memory or results in system calls returning error messages. It doesn't corruption memory.
In this case, the result of backticks is undef, which means the command could not be executed. That could very well be because you have insufficient memory left. Where did you get the idea that being unable to execute a program is the result of corrupted memory?! Furthermore, you have an error you don't understand, yet you didn't check what error was returned? Backticks sets $? (and $! when $? is negative) as per system. Assuming it's a bug in Perl is a very bad assumption to make, especially when the system tells you what error occurred.
Use less memory, either through the use of a more appropriate and/or efficient data structure, or by keeping a portion of the data out of memory (e.g. on disk or in a database).

awk: Interpreting strings as mathematical expressions

Context: I have an input file that contains parameters with associated values followed by literal mathematical expressions such as:
PARAMETERS DEFINITION
A = 5; B = 2; C=1.5; D=7.5
MATHEMATICAL EXPRESSIONS
A*B
C/D
...
and I would like to get the strings of the second part to be interpreted as mathematical expressions so that I get the results of the expressions in my output file:
...
MATHEMATICAL EXPRESSIONS
10
0.2
...
What I did already: So far, using awk, I store all the parameters names and their corresponding values in two distinct arrays. I then replace each parameter with its value so that I am now in a similar situation as the author of this thread.
However, the answers s/he gets are not in awk except for the last one which is very specific to her/his situation, and hard to understand for me as a beginner with awk and shell scripting.
What I tried afterwards: As I have no clue how to do this in awk, the idea I had was to store the new field value in a variable, then use a shell command within the awk script like this:
#!bin/awk -f
BEGIN{}
{
myExpression=$1
system("echo $myExpression | bc")
}
END{}
This, unfortunately does not work as the variable is somehow not recognized by the echo command.
What I would like:
I would prefer a solution using awk alone with no call to external functions, however, I am not against one using a shell command if it is simpler.
EDIT Taking into account all the comments so far, I will be more precise, my input files look more like this:
PARAMETERS_DEFINITION
[param1] = 5
[param2] = 2
[param3] = 1.5
[param4] = 7.5
MATHEMATICAL_EXPRESSIONS
[param1]*[param2]
some text containing also numbers and formulas that I do not want to be affected.
e.g: 1.45*2.6 = x, de(x)/dx=e(x) ; blah,blah,blah
[param3]/[param4]
The names of the parameters are complex enough so that any match of the string: "[param#]" within the document corresponds to a parameter that I want changed for its value.
Below is the way I manage to store the parameters and their value in arrays is the following:
{
if (match($2,/PARAMETERS_DEFINITION/) != 0) {paramSwitch = 1}
if (match($2,/MATHEMATICAL_EXPRESSIONS/) != 0) {paramSwitch = 0}
if (paramSwitch == 1)
{
parameterName[numOfParam] = $1 ;
parameterVal[numOfParam] = $3 ;
numOfParam += 1
}
}
Instead of this:
{
myExpression=$1
system("echo $myExpression | bc")
}
I think you'd want this:
{
myExpression=$1
system("echo " myExpression " | bc")
}
That's because in awk, assignments do not end up as environment variables, and putting strings next to each other concatenates them.
You asking awk: Interpreting strings as mathematical expressions - this functionality usually called as eval, and no, (AFAIK) awk doesn't knows such function. Therefore your questions is an typical XY problem
The right tool for this is bc, where you (nearly) don't need modify anything, and simply feed the bc with your input, only ensure than the variables are are lowercase, such the following input (edited the your example)
#PARAMETERS DEFINITION
a=5; b=2; c=1.5; d=7.5
#MATHEMATICAL EXPRESSIONS
a*b
c/d
using like
bc -l < inputfile
produces
10
.20000000000000000000
EDIT
For your edit, for the new input data. The following
grep '\[' inputfile | sed 's/[][]//g' | bc -l
for the input
PARAMETERS_DEFINITION
[param1] = 5
[param2] = 2
[param3] = 1.5
[param4] = 7.5
MATHEMATICAL_EXPRESSIONS
[param1]*[param2]
some text containing also numbers and formulas that I do not want to be affected.
e.g: 1.45*2.6 = x, de(x)/dx=e(x) ; blah,blah,blah
[param3]/[param4]
produces the following output:
10
.20000000000000000000
e.g. grepping out only lines what contains [ - any param definition or expression, remove any [], e.g. creating the following bc program:
param1 = 5
param2 = 2
param3 = 1.5
param4 = 7.5
param1*param2
param3/param4
and send the whole "program" to bc...
Using BIDMAS as a basis i have created this mathematical function in awk
I have not included brackets(or indices) yet as they will require some extra effort but i may add them later
This awk script effectively works as bc does.
No system call required, all in awk.
Generic version for all applications
awk '{split($0,a,"+")
for(i in a){
split(a[i],s,"-")
for(j in s){
split(s[j],m,"*")
for(k in m){
split(m[k],d,"/")
for(l in d){
if(l>1)d[1]=d[1]/d[l]
}
m[k]=d[1]
delete d
if(k>1)m[1]=m[1]*m[k]
}
s[j]=m[1]
delete m
if(j>1)s[1]=s[1]-s[j]
}
a[i]=s[1]
delete s
}
for(i in a)b=b+a[i];print b}{b=0}' file
For your specific example
awk '
/MATHEMATICAL_EXPRESSIONS/{z=1}
NR>1&&!z{split($0,y," = ");x[y[1]]=y[2]}
z&&/[\+\-\/\*]/{
for (n in x)gsub(n,x[n])
split($0,a,"+")
for(i in a){
split(a[i],s,"-")
for(j in s){
split(s[j],m,"*")
for(k in m){
split(m[k],d,"/")
for(l in d){
if(l>1)d[1]=d[1]/d[l]
}
m[k]=d[1]
delete d
if(k>1)m[1]=m[1]*m[k]
}
s[j]=m[1]
delete m
if(j>1)s[1]=s[1]-s[j]
}
a[i]=s[1]
delete s
}
for(i in a)b=b+a[i];print b}{b=0}' file
There's something like an eval for awk, its a magical conversion when needed in the context, here adding +0 would do the convertion.
What I got for you (detailled version below) with a file named awkinput with your exemple input
awk '/[A-Z]=[0-9.]+;/ { for (i=1;i<=NF ;i++) { print "working on "$i; split($i,fields,"="); sub(/;/,"",fields[2]); params[fields[1]]=strtonum(fields[2]) } }; /[A-Z](*|\/|+|-)[A-Z]/ { for (p in params) { sub(p, params[p],$0); }; system("echo " $0 " | bc -ql") }' awkinput
Detailled:
/[A-Z]=[0-9.]+;?/ { # if we match something like A=4.2 with or wothout a ; at end
for (i=1;i<=NF ;i++) { # loop through the fields (separated by space, the default Field Separator of awk)
print "working on "$i; # inform on what we do
split($i,fields,"="); # split in an array to get param and value
sub(/;/,"",fields[2]); # Eventually remove the ; at end
params[fields[1]]=strtonum(fields[2]) # new array of parameters where the values are numeric
}
}
/[A-Z](*|\/|+|-)[A-Z]/ { #when the line match a math operation with one param on each side (at least)
for (p in params) { # loop over know params
sub(p, params[p],$0); # replace each param with its value
};
system("echo " $0 " | bc -ql") # print the result (no way to get of system call here)
}
Drawback:
A math of the form AB*C would be resolved to 52*1.5
$ cat test
PARAMETERS DEFINITION
A=5; B=2; C=1.5; D=7.5
MATHEMATICAL EXPRESSIONS
A*B
C/D
$ awk -vRS='[= ;\n]' '{if ($0 ~ /[0-9]/){a[x] = $0; print x"="a[x]}else{x=$0}}/MATHEMATICAL/{print "MATHEMATICAL EXPRESSIONS"}{if ($0~"*") print a[substr($0,1,1)] * a[substr($0,3,1)]}{if ($0~"/") print a[substr($0,1,1)] / a[substr($0,3,1)]}' test
A=5
B=2
C=1.5
D=7.5
MATHEMATICAL EXPRESSIONS
10
0.2
Formatted nicely:
$ cat test.awk
# Store all variables in an array
{
if ($0 ~ /[0-9]/){
a[x] = $0;
print x " = " a[x] # Print the keys & values
}
else{
x = $0
}
}
# Print header
/MATHEMATICAL/ {print "MATHEMATICAL EXPRESSIONS"}
# Do the maths (case can work too, but it's not as widely available)
{
if ($0~"*")
print a[substr($0,1,1)] * a[substr($0,3,1)]
}
{
if ($0~"/")
print a[substr($0,1,1)] / a[substr($0,3,1)]
}
{
if ($0~"+")
print a[substr($0,1,1)] + a[substr($0,3,1)]
}
{
if ($0~"-")
print a[substr($0,1,1)] - a[substr($0,3,1)]
}
$ cat test
PARAMETERS DEFINITION
A=5; B=2; C=1.5; D=7.5
MATHEMATICAL EXPRESSIONS
A*B
C/D
D+C
C-A
$ awk -f test.awk -vRS='[= ;\n]' test
A = 5
B = 2
C = 1.5
D = 7.5
MATHEMATICAL EXPRESSIONS
10
0.2
9
-3.5

Script to find words inside a given word from wordlist

I have a dictionary with 250K words (txt file). For each of those words I would like to come up with a script that will throw all possible anagrams (each anagram should also be in the dictionary).
Ideally the script would output in this format:
word1: anagram1,anagram2...
word2: anagram1,anagram2...
Any help would be greatly appreciated.
Inspired by this, I would suggest you create a Trie.
Then, the trie with N levels will have all possible anagrams (where N is the length of the original word). Now, to get different sized words, I suggest you simply traverse the trie, ie. for all 3 letter subwords, just make all strings that are 3 levels deep in the trie.
I'm not really sure of this, because I didn't test this, but it's an interesting challenge, and this suggestion would be how I would start tackling it.
Hope it helps a little =)
It must be anagram week.
I'm going to refer you to an answer I submitted to a prior question: https://stackoverflow.com/a/12811405/128421. It shows how to build a hash for quick searches of words that have common letters.
For your purpose, of finding substrings/inner-words, you will also want to find the possible inner words. Here's how to quickly locate unique combinations of letters of varying sizes, based on a starting word:
word = 'misses'
word_letters = word.downcase.split('').sort
3.upto(word.length) { |i| puts word_letters.combination(i).map(&:join).uniq }
eim
eis
ems
ess
ims
iss
mss
sss
eims
eiss
emss
esss
imss
isss
msss
eimss
eisss
emsss
imsss
eimsss
Once you have those combinations, split them (or don't do the join) and do look-ups in the hash my previous answer built.
What I tried so far in Perl :
use strict;
use warnings;
use Algorithm::Combinatorics qw(permutations);
die "First argument should be a dict\n" unless $ARGV[0] or die $!;
open my $fh, "<", $ARGV[0] or die $!;
my #arr = <$fh>;
my $h = {};
map { chomp; $h->{lc($_)} = [] } #arr;
foreach my $word (#arr) {
$word = lc($word);
my $chars = [ ( $word =~ m/./g ) ];
my $it = permutations($chars);
while ( my $p = $it->next ) {
my $str = join "", #$p;
if ($str ne $word && exists $h->{$str}) {
push #{ $h->{$word} }, $str
unless grep { /^$str$/ } #{ $h->{$word} };
}
}
if (#{ $h->{$word} }) {
print "$word\n";
print "\t$_\n" for #{ $h->{$word} };
}
}
END{ close $fh; }
There's maybe some possible improvement for speed, but it works.
I use French dict from words archlinux package.
EXAMPLE
$ perl annagrammes.pl /usr/share/dict/french
abaissent
absentais
abstenais
abaisser
baissera
baserais
rabaisse
(...)
NOTE
To installl the perl module :
cpan -i Algorithm::Combinatorics
h = Hash.new{[]}
array_of_words.each{|w| h[w.downcase.chars.sort].push(w)}
h.values

Resources