Bash/Nawk whitespace problems - bash

I have 100 datafiles, each with 1000 rows, and they all look something like this:
0 0 0 0
1 0 1 0
2 0 1 -1
3 0 1 -2
4 1 1 -2
5 1 1 -3
6 1 0 -3
7 2 0 -3
8 2 0 -4
9 3 0 -4
10 4 0 -4
.
.
.
999 1 47 -21
1000 2 47 -21
I have developed a script which is supposed to take the square of each value in columns 2,3,4, and then sum and square root them.
Like so:
temp = ($t1*$t1) + ($t2*$t2) + ($t3*$t3)
calc = $calc + sqrt ($temp)
It then calculates the square of that value, and averages these numbers over every data-file to output the average "calc" for each row and average "fluc" for each row.
The meaning of these numbers is this:
The first number is the step number, the next three are coordinates on the x, y and z axis respectively. I am trying to find the distance the "steps" have taken me from the origin, this is calculated with the formula r = sqrt(x^2 + y^2 + z^2). Next I need the fluctuation of r, which is calculated as f = r^4 or f = (r^2)^2.
These must be averages over the 100 data files, which leads me to:
r = r + sqrt(x^2 + y^2 + z^2)
avg = r/s
and similarly for f where s is the number of read data files which I figure out using sum=$(ls -l *.data | wc -l).
Finally, my last calculation is the deviation between the expected r and the average r, which is calculated as stddev = sqrt(fluc - (r^2)^2) outside of the loop using final values.
The script I created is:
#!/bin/bash
sum=$(ls -l *.data | wc -l)
paste -d"\t" *.data | nawk -v s="$sum" '{
for(i=0;i<=s-1;i++)
{
t1 = 2+(i*4)
t2 = 3+(i*4)
t3 = 4+(i*4)
temp = ($t1*$t1) + ($t2*$t2) + ($t3*$t3)
calc = $calc + sqrt ($temp)
fluc = $fluc + ($calc*$calc)
}
stddev = sqrt(($calc^2) - ($fluc))
print $1" "calc/s" "fluc/s" "stddev
temp=0
calc=0
stddev=0
}'
Unfortunately, part way through I receive an error:
nawk: cmd. line:9: (FILENAME=- FNR=3) fatal: attempt to access field -1
I am not experienced enough with awk to be able to figure out exactly where I am going wrong, could someone point me in the right direction or give me a better script?
The expected output is one file with:
0 0 0 0
1 (calc for all 1's) (fluc for all 1's) (stddev for all 1's)
2 (calc for all 2's) (fluc for all 2's) (stddev for all 2's)
.
.
.

The following script should do what you want. The only thing that might not work yet is the choice of delimiters. In your original script you seem to have tabs. My solution assumes spaces. But changing that should not be a problem.
It simply pipes all files sequentially into the nawk without counting the files first. I understand that this is not required. Instead of trying to keep track of positions in the file it uses arrays to store seperate statistical data for each step. In the end it iterates over all step indexes found and outputs them. Since the iteration is not sorted there is another pipe into a Unix sort call which handles this.
#!/bin/bash
# pipe the data of all files into the nawk processor
cat *.data | nawk '
BEGIN {
FS=" " # set the delimiter for the columns
}
{
step = $1 # step is in column 1
temp = $2*$2 + $3*$3 + $4*$4
# use arrays indexed by step to store data
calc[step] = calc[step] + sqrt (temp)
fluc[step] = fluc[step] + calc[step]*calc[step]
count[step] = count[step] + 1 # count the number of samples seen for a step
}
END {
# iterate over all existing steps (this is not sorted!)
for (i in count) {
stddev = sqrt((calc[i] * calc[i]) + (fluc[i] * fluc[i]))
print i" "calc[i]/count[i]" "fluc[i]/count[i]" "stddev
}
}' | sort -n -k 1 # that' why we sort here: first column "-k 1" and numerically "-n"
EDIT
As sugested by #edmorton awk can take care of loading the files itself. The following enhanced version removes the call to cat and instead passes the file pattern as parameter to nawk. Also, as suggested by #NictraSavios the new version introduces a special handling for the output of the statistics of the last step. Note that the gathering of the statistics is still done for all steps. It's a little difficult to suppress this during the reading of the data since at that point we don't know yet what the last step will be. Although this can be done with some extra effort you would probably loose a lot of robustness of your data handling since right now the script does not make any assumptions about:
the number of files provided,
the order of the files processed,
the number of steps in each file,
the order of the steps in a file,
the completeness of steps as a range without "holes".
Enhanced script:
#!/bin/bash
nawk '
BEGIN {
FS=" " # set the delimiter for the columns (not really required for space which is the default)
maxstep = -1
}
{
step = $1 # step is in column 1
temp = $2*$2 + $3*$3 + $4*$4
# remember maximum step for selected output
if (step > maxstep)
maxstep = step
# use arrays indexed by step to store data
calc[step] = calc[step] + sqrt (temp)
fluc[step] = fluc[step] + calc[step]*calc[step]
count[step] = count[step] + 1 # count the number of samples seen for a step
}
END {
# iterate over all existing steps (this is not sorted!)
for (i in count) {
stddev = sqrt((calc[i] * calc[i]) + (fluc[i] * fluc[i]))
if (i == maxstep)
# handle the last step in a special way
print i" "calc[i]/count[i]" "fluc[i]/count[i]" "stddev
else
# this is the normal handling
print i" "calc[i]/count[i]
}
}' *.data | sort -n -k 1 # that' why we sort here: first column "-k 1" and numerically "-n"

You could also use:
awk -f c.awk *.data
where c.awk is
{
j=FNR
temp=$2*$2+$3*$3+$4*$4
calc[j]=calc[j]+sqrt(temp)
fluc[j]=fluc[j]+calc[j]*calc[j]
}
END {
N=ARGIND
for (i=1; i<=FNR; i++) {
stdev=sqrt(fluc[i]-calc[i]*calc[i])
print i-1,calc[i]/N,fluc[i]/N,stdev
}
}

Related

Find mean and maximum in 2nd column for a selection in 1st column

I have two columns as follows
ifile.dat
1 10
3 34
1 4
3 32
5 3
2 2
4 20
3 13
4 50
1 40
2 20
5 2
I would like to calculate the mean and maximum values in 2nd column for some selection in 1st column.
ofile.dat
1-2 40 15.2 #Here 1-2 means all values in 1st column ranging from 1 to 2;
#40 is the maximum of corresponding values in 2nd column and 15.2 is their mean i.e. (10+4+2+40+20)/5
3-4 50 29.8 #Here 3-4 means all values in 1st column ranging from 3 to 4;
#50 is their maximum and 29.8 is their mean i.e. (34+32+20+13+50)/5
5-6 3 2.5 #Here 5-6 means all values in 1st column ranging from 5 to 6;
#3 is their maximum and 2.5 is their mean i.e. (3+2)/2
Similarly if I choose the range of selection with 3 number, then the desire output will be
ofile.dat
1-3 40 19.37
4-6 50 18.7
I have the following script which calculates for single values in the 1st column. But I am looking for multiple selections from 1st column.
awk '{
if (a[$1] < $2) { a[$1]=$2 }} END { for (i in a){}}
{b[$1]+=$2; c[$1]++} END{for (i in b)
printf "%d %2s %5s %5.2f\n", i, OFS, a[i], b[i]/c[i]}' ifile.dat
The original data has the values in the 1st column varying from 1 to 100000. So I need to stratify with an interval of 1000. i.e. 1-1000, 1001-2000, 2001-3000,...
The following awk script will provide basic descriptive statistics with grouping.
Suggesting to look into more robust solution (Python, Perl, R, ...) which will support additional measures, flexibility - no point to reinvent the circle.
Logic for grouping separated is 1-1000, 1001-2000, as per comment above. Code is verbose for clarity.
awk '
{
# Total Counter
nn++ ;
# Group id
gsize = 1000
gid = int(($1-1)/gsize )
v = $2
# Setup new group, if needed
if ( !n[gid] ) {
n[gid] = 0
sum[gid] = 0
max[gid] = min[gid] = v
name[gid] = (gid*gsize +1) "-" ((gid+1)*gsize)
}
if ( v > max[gid] ) max[gid] = v
sum[gid] += v
n[gid]++
}
END {
# Print all groups
for (gid in name) {
printf "%-20s %4d %6.1f %5.1F\n", name[gid], max[gid], sum[gid]/n[gid], n[gid]/nn ;
}
}
'
Could you please try following, tested and written with shown samples only.
sort -k1 Input_file |
awk -v range="1" '
!b[$1]++{
c[++count]=$1
}
{
a[$1]=a[$1]>$2?a[$2]:$2
d[$1]+=$2
e[$1]++
till=$1
}
END{
for(i=1;i<=till;i+=(range+1)){
for(j=i;j<=i+range;j++){
max=max>a[c[j]]?max:a[c[j]]
total+=d[c[j]]
occr+=e[c[j]]
}
print i"-"i+range,max,occr?total/occr:0
occr=total=max=""
}
}
'
For shown samples output will be as follows.
1-2 40 15.2
3-4 50 29.8
5-6 3 2.5
I have kept range variable as 1 since difference of 1st digit is 2 so in your case case lets say 1,1001 and so on is there then keep range variable value as 999 for same.

Find the closest values: Multiple columns conditions

Following my first question here I want to extend the condition of find the closest value from two different files of the first and second column, and print specific columns.
File1
1 2 3 4 a1
1 4 5 6 b1
8 5 9 11 c1
File 2
1 1 3 a
1 2 5 b
1 2.1 4 c
1 4 6 d
2 4 5 e
9 4 1 f
9 5 2 g
9 6 2 h
11 10 14 i
11 15 5 j
So for example I need to find the closest value from $1 in file 2 for each $1 in file 1 but then search the closest also for $2.
Output:
1 2 a1*
1 2 b*
1 4 b1
1 4 d
8 5 c1
9 5 g
* First column file 1 and 2nd column file 2 because for the 1st column (of file 1) the closest value (from the 1st column of file 2) is 1, and the 2nd condition is that also must be the closest value for the second column which is this case is 2. And I print $1,$2,$5 from file 1 and $1,$2,$4 from file 2
For the other output is the same procedure.
The solution to find the closest it is in my other post and was given by #Tensibai.
But any solution will work.
Thanks!
Sounds a little convoluted but works:
function closest(array,searched) {
distance=999999; # this should be higher than the max index to avoid returning null
split(searched,skeys,OFS)
# Get the first part of key
for (x in array) { # loop over the array to get its keys
split(x,mkeys,OFS) # split the array key
(mkeys[1]+0 > skeys[1]+0) ? tmp = mkeys[1] - skeys[1] : tmp = skeys[1] - mkeys[1] # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
if (tmp < distance) { # if the distance if less than preceding, update
distance = tmp
found1 = mkeys[1] # and save the key actually found closest
}
}
# At this point we have the first part of key found, let's redo the work for the second part
distance=999999;
for (x in array) {
split(x,mkeys,OFS)
if (mkeys[1] == found1) { # Filter on the first part of key
(mkeys[2]+0 > skeys[2]+0) ? tmp = mkeys[2] - skeys[2] : tmp = skeys[2] - mkeys[2] # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
if (tmp < distance) { # if the distance if less than preceding, update
distance = tmp
found2 = mkeys[2] # and save the key actually found closest
}
}
}
# Now we got the second field, woot
return (found1 OFS found2) # return the combined key from out two search
}
{
if (NR>FNR) { # If we changed file (File Number Record is less than Number Record) change array
b[($1 OFS $2)] = $4 # make a array with "$1 $2" as key and $4 as value
} else {
key = ($1 OFS $2) # Make the key to avoid too much computation accessing it later
akeys[max++] = key # store the array keys to ensure order at end as for (x in array) does not guarantee the order
a[key] = $5 # make an array with the key stored previously and $5 as value
}
}
END { # Now we ended parsing the two files, print the result
for (i in akeys) { # loop over the array of keys which has a numeric index, keeping order
print akeys[i],a[akeys[i]] # print the value for the first array (key then value)
if (akeys[i] in b) { # if the same key exist in second file
print akeys[i],b[akeys[i]] # then print it
} else {
bindex = closest(b,akeys[i]) # call the function to find the closest key from second file
print bindex,b[bindex] # print what we found
}
}
}
Note I'm using OFS to combine the fields so if you change it for output it will behave properly.
WARNING: This should do with relative short files, but as now the array from second file is traversed twice it will be twice long for each searchEND OF WARNING
There's place for a better search algorithm if your files are sorted (but it was not the case on previous question and you wished to keep the order from the file). First improvement in this case, break the for loop when distance start to be greater than preceding one.
Output from your sample files:
$ mawk -f closest2.awk f1 f2
1 2 a1
1 2 b
1 4 b1
1 4 d
8 5 c1
9 5 g

Variable format

I wrote a program to calculate a square finite difference matrix, where you can enter the number of rows (equals the number of columns) -> this is stored in the variable matrix. The program works fine:
program fin_diff_matrix
implicit none
integer, dimension(:,:), allocatable :: A
integer :: matrix,i,j
print *,'Enter elements:'
read *, matrix
allocate(A(matrix,matrix))
A = 0
A(1,1) = 2
A(1,2) = -1
A(matrix,matrix) = 2
A(matrix,matrix-1) = -1
do j=2,matrix-1
A(j,j-1) = -1
A(j,j) = 2
A(j,j+1) = -1
end do
print *, 'Matrix A: '
write(*,1) A
1 format(6i10)
end program fin_diff_matrix
For the output I want that matrix is formatted for the output, e.g. if the user enters 6 rows the output should also look like:
2 -1 0 0 0 0
-1 2 -1 0 0 0
0 -1 2 -1 0 0
0 0 -1 2 -1 0
0 0 0 -1 2 -1
0 0 0 0 -1 2
The output of the format should also be variable, e.g. if the user enters 10, the output should also be formatted in 10 columns. Research on the Internet gave the following solution for the format statement with angle brackets:
1 format(<matrix>i<10)
If I compile with gfortran in Linux I always get the following error in the terminal:
fin_diff_matrix.f95:37.12:
1 format(<matrix>i10)
1
Error: Unexpected element '<' in format string at (1)
fin_diff_matrix.f95:35.11:
write(*,1) A
1
Error: FORMAT label 1 at (1) not defined
What doesn't that work and what is my mistake?
The syntax you are trying to use is non-standard, it works only in some compilers and I discourage using it.
Also, forget the FORMAT() statements for good, they are obsolete.
You can get your own number inside the format string when you construct it yourself from several parts
character(80) :: form
form = '( (i10,1x))'
write(form(2:11),'(i10)') matrix
write(*,form) A
You can also write your matrix in a loop per row and then you can use an arbitrarily large count number or a * in Fortran 2008.
do i = 1, matrix
write(*,'(999(i10,1x))') A(:,i)
end do
do i = 1, matrix
write(*,'(*(i10,1x))') A
end do
Just check if I did not transpose the matrix inadvertently.

Why is this perl script so slow?

I have written a perl script to implement in the polymake framework.
The script accomplishes the following task. It's input is a matrix $M from polymake. Issuing
print ref($M);
returns
Polymake::common::Matrix_A_Rational_I_NonSymmetric_Z
The output of the script is a matrix whose rows are the rows obtained from M by taking all the sums of all subsets of rows of M. For example, the input of
1 0 0
1 3 0
1 0 3
returns
0 0 0
3 3 3
2 3 3
2 0 3
1 0 3
2 3 0
1 3 0
1 0 0
The code for my script is
use application 'polytope';
use Data::PowerSet 'powerset';
sub newVertices {
my ($orig) = #_;
my $n = $orig -> rows;
my $vert = zero_vector($n);
my $d = Data::PowerSet->new( {min => 1}, map {"$_"} (0..$n-1) );
while (my $r = $d -> next) {
my $v = new Vector($n);
foreach (#$r) {
$v += #$orig[$_];
}
$vert = $vert / $v;
}
return $vert;
}
The script works fine. My issue is that the script runs a lot slower than I think it should. The example I mention above takes about half a minute to run. My questions are
Am I correct to expect that such a task should be able to be accomplished quickly for matrices with relatively few rows?
Why is this script so slow?
Is there a way to accomplish this task more efficiently?
I suspect the problem might be the matrix object type I am dealing with it and maybe writing a script that does this purely with arrays might be more efficient. My problem is I cannot figure out how to transition from an array back to a matrix so I haven't made an attempt to do something along these lines.

Code Golf: Quickly Build List of Keywords from Text, Including # of Instances

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc).
Return only words with an occurrence greater than X
Return only words with a length greater than Y
Ignore common terms like "and, is, the, etc"
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John")
Return results in a collection/array
Extra Credit
Keep Quoted Statements together, (ie. "They were 'too good to be true' apparently")Where 'too good to be true' would be the actual statement
Extra-Extra Credit
Can your script determine words that should be kept together based upon their frequency of being found together? This being done without knowing the words beforehand. Example:
*"The fruit fly is a great thing when it comes to medical research. Much study has been done on the fruit fly in the past, and has lead to many breakthroughs. In the future, the fruit fly will continue to be studied, but our methods may change."*
Clearly the word here is "fruit fly," which is easy for us to find. Can your search'n'scrape script determine this too?
Source text: http://sampsonresume.com/labs/c.txt
Answer Format
It would be great to see the results of your code, output, in addition to how long the operation lasted.
GNU scripting
sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | sort -nr
Results:
7 be
6 to
[...]
1 2.
1 -
With occurence greater than X:
sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | awk '$1>X'
Return only words with a length greater than Y (put Y+1 dots in second grep):
sed -e 's/ /\n/g' | grep -v '^ *$' | grep .... | sort | uniq -c
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
sed -e 's/ /\n/g' | grep -v '^ *$' | grep -vf ignored | sort | uniq -c
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
sed -e 's/[,.:"\']//g;s/ /\n/g' | grep -v '^ *$' | sort | uniq -c
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
Perl in only 43 characters.
perl -MYAML -anE'$_{$_}++for#F;say Dump\%_'
Here is an example of it's use:
echo a a a b b c d e aa | perl -MYAML -anE'$_{$_}++for#F;say Dump \%_'
---
a: 3
aa: 1
b: 2
c: 1
d: 1
e: 1
If you need to list only the lowercase versions, it requires two more characters.
perl -MYAML -anE'$_{lc$_}++for#F;say Dump\%_'
For it to work on the specified text requires 58 characters.
curl http://sampsonresume.com/labs/c.txt |
perl -MYAML -F'\W+' -anE'$_{lc$_}++for#F;END{say Dump\%_}'
real 0m0.679s
user 0m0.304s
sys 0m0.084s
Here is the last example expanded a bit.
#! perl
use 5.010;
use YAML;
while( my $line = <> ){
for my $elem ( split '\W+', $line ){
$_{ lc $elem }++
}
END{
say Dump \%_;
}
}
F#: 304 chars
let f =
let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"]
fun length occurrence msg ->
System.Text.RegularExpressions.Regex.Split(msg, #"[^\w-']+")
|> Seq.countBy (fun a -> a)
|> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)
Ruby
When "minified", this implementation becomes 165 characters long. It uses array#inject to give a starting value (a Hash object with a default of 0) and then loop through the elements, which are then rolled into the hash; the result is then selected from the minimum frequency.
Note that I didn't count the size of the words to skip, that being an external constant. When the constant is counted too, the solution is 244 characters long.
Apostrophes and dashes aren't stripped, but included; their use modifies the word and therefore cannot be stripped simply without removal of all information beyond the symbol.
Implementation
CommonWords = %w(the a an but and is not or as of to in for by be may has can its it's)
def get_keywords(text, minFreq=0, minLen=2)
text.scan(/(?:\b)[a-z'-]{#{minLen},}(?=\b)/i).
inject(Hash.new(0)) do |result,w|
w.downcase!
result[w] += 1 unless CommonWords.include?(w)
result
end.select { |k,n| n >= minFreq }
end
Test Rig
require 'net/http'
keywords = get_keywords(Net::HTTP.get('www.sampsonresume.com','/labs/c.txt'), 3)
keywords.sort.each { |name,count| puts "#{name} x #{count} times" }
Test Results
code x 4 times
declarations x 4 times
each x 3 times
execution x 3 times
expression x 4 times
function x 5 times
keywords x 3 times
language x 3 times
languages x 3 times
new x 3 times
operators x 4 times
programming x 3 times
statement x 7 times
statements x 4 times
such x 3 times
types x 3 times
variables x 3 times
which x 4 times
C# 3.0 (with LINQ)
Here's my solution. It makes use of some pretty nice features of LINQ/extension methods to keep the code short.
public static Dictionary<string, int> GetKeywords(string text, int minCount, int minLength)
{
var commonWords = new string[] { "and", "is", "the", "as", "of", "to", "or", "in",
"for", "by", "an", "be", "may", "has", "can", "its"};
var words = Regex.Replace(text.ToLower(), #"[,.?\/;:\(\)]", string.Empty).Split(' ');
var occurrences = words.Distinct().Except(commonWords).Select(w =>
new { Word = w, Count = words.Count(s => s == w) });
return occurrences.Where(wo => wo.Count >= minCount && wo.Word.Length >= minLength)
.ToDictionary(wo => wo.Word, wo => wo.Count);
}
This is however far from the most efficient method, being O(n^2) with the number of words, rather than O(n), which is optimal in this case I believe. I'll see if I can creater a slightly longer method that is more efficient.
Here are the results of the function run on the sample text (min occurences: 3, min length: 2).
3 x such
4 x code
4 x which
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
7 x statement
3 x language
3 x expression
3 x execution
3 x programming
4 x operators
3 x variables
And my test program:
static void Main(string[] args)
{
string sampleText;
using (var client = new WebClient())
sampleText = client.DownloadString("http://sampsonresume.com/labs/c.txt");
var keywords = GetKeywords(sampleText, 3, 2);
foreach (var entry in keywords)
Console.WriteLine("{0} x {1}", entry.Value.ToString().PadLeft(3), entry.Key);
Console.ReadKey(true);
}
#! perl
use strict;
use warnings;
while (<>) {
for my $word (split) {
$words{$word}++;
}
}
for my $word (keys %words) {
print "$word occurred $words{$word} times.";
}
That's the simple form. If you want sorting, filtering, etc.:
while (<>) {
for my $word (split) {
$words{$word}++;
}
}
for my $word (keys %words) {
if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
print "$word occurred $words{$word} times.";
}
}
You can also sort the output pretty easily:
...
for my $word (keys %words) {
if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
push #output, "$word occurred $words{$word} times.";
}
}
$re = qr/occurred (\d+) /;
print sort {
$a = $a =~ $re;
$b = $b =~ $re;
$a <=> $b
} #output;
A true Perl hacker will easily get these on one or two lines each, but I went for readability.
Edit: this is how I would rewrite this last example
...
for my $word (
sort { $words{$a} <=> $words{$b} } keys %words
){
next unless length($word) >= $MINLEN;
last unless $words{$word) >= $MIN_OCCURRENCE;
print "$word occurred $words{$word} times.";
}
Or if I needed it to run faster I might even write it like this:
for my $word_data (
sort {
$a->[1] <=> $b->[1] # numerical sort on count
} grep {
# remove values that are out of bounds
length($_->[0]) >= $MINLEN && # word length
$_->[1] >= $MIN_OCCURRENCE # count
} map {
# [ word, count ]
[ $_, $words{$_} ]
} keys %words
){
my( $word, $count ) = #$word_data;
print "$word occurred $count times.";
}
It uses map for efficiency,
grep to remove extra elements,
and sort to do the sorting, of course. ( it does so it in that order )
This is a slight variant of the Schwartzian transform.
Another Python solution, at 247 chars. The actual code is a single line of highly dense Python line of 134 chars that computes the whole thing in a single expression.
x=3;y=2;W="and is the as of to or in for by an be may has can its".split()
from itertools import groupby as gb
d=dict((w,l)for w,l in((w,len(list(g)))for w,g in
gb(sorted(open("c.txt").read().lower().split())))
if l>x and len(w)>y and w not in W)
A much longer version with plenty of comments for you reading pleasure:
# High and low count boundaries.
x = 3
y = 2
# Common words string split into a list by spaces.
Words = "and is the as of to or in for by an be may has can its".split()
# A special function that groups similar strings in a list into a
# (string, grouper) pairs. Grouper is a generator of occurences (see below).
from itertools import groupby
# Reads the entire file, converts it to lower case and splits on whitespace
# to create a list of words
sortedWords = sorted(open("c.txt").read().lower().split())
# Using the groupby function, groups similar words together.
# Since grouper is a generator of occurences we need to use len(list(grouper))
# to get the word count by first converting the generator to a list and then
# getting the length of the list.
wordCounts = ((word, len(list(grouper))) for word, grouper in groupby(sortedWords))
# Filters the words by number of occurences and common words using yet another
# list comprehension.
filteredWordCounts = ((word, count) for word, count in wordCounts if word not in Words and count > x and len(word) > y)
# Creates a dictionary from the list of tuples.
result = dict(filteredWordCounts)
print result
The main trick here is using the itertools.groupby function to count the occurrences on a sorted list. Don't know if it really saves characters, but it does allow all the processing to happen in a single expression.
Results:
{'function': 4, 'operators': 4, 'declarations': 4, 'which': 4, 'statement': 5}
C# code:
IEnumerable<KeyValuePair<String, Int32>> ProcessText(String text, int X, int Y)
{
// common words, that will be ignored
var exclude = new string[] { "and", "is", "the", "as", "of", "to", "or", "in", "for", "by", "an", "be", "may", "has", "can", "its" }.ToDictionary(word => word);
// regular expression to find quoted text
var regex = new Regex("\"[^\"]\"", RegexOptions.Compiled);
return
// remove quoted text (it will be processed later)
regex.Replace(text, "")
// remove case dependency
.ToLower()
// split text by all these chars
.Split(".,'\\/[]{}()`~##$%^&*-=+?!;:<>| \n\r".ToCharArray())
// add quoted text
.Concat(regex.Matches(text).Cast<Match>().Select(match => match.Value))
// group words by the word and count them
.GroupBy(word => word, (word, words) => new KeyValuePair<String, Int32>(word, words.Count()))
// apply filter(min word count and word length) and remove common words
.Where(pair => pair.Value >= X && pair.Key.Length >= Y && !exclude.ContainsKey(pair.Key));
}
Output for ProcessText(text, 3, 2) call:
3 x languages
3 x such
4 x code
4 x which
3 x based
3 x each
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
3 x variables
7 x statement
4 x expression
3 x execution
3 x programming
3 x operators
In C#:
Use LINQ, specifically groupby, then filter by group count, and return a flattened (selectmany) list.
Use LINQ, filter by length.
Use LINQ, filter with 'badwords'.Contains.
REBOL
Verbose, perhaps, so definitely not a winner, but gets the job done.
min-length: 0
min-count: 0
common-words: [ "a" "an" "as" "and" "are" "by" "for" "from" "in" "is" "it" "its" "the" "of" "or" "to" "until" ]
add-word: func [
word [string!]
/local
count
letter
non-letter
temp
rules
match
][
; Strip out punctuation
temp: copy {}
letter: charset [ #"a" - #"z" #"A" - #"Z" #" " ]
non-letter: complement letter
rules: [
some [
copy match letter (append temp match)
|
non-letter
]
]
parse/all word rules
word: temp
; If we end up with nothing, bail
if 0 == length? word [
exit
]
; Check length
if min-length > length? word [
exit
]
; Ignore common words
ignore:
if find common-words word [
exit
]
; OK, its good. Add it.
either found? count: select words word [
words/(word): count + 1
][
repend words [word 1]
]
]
rules: [
some [
{"}
copy word to {"} (add-word word)
{"}
|
copy word to { } (add-word word)
{ }
]
end
]
words: copy []
parse/all read %c.txt rules
result: copy []
foreach word words [
if string? word [
count: words/:word
if count >= min-count [
append result word
]
]
]
sort result
foreach word result [ print word ]
The output is:
act
actions
all
allows
also
any
appear
arbitrary
arguments
assign
assigned
based
be
because
been
before
below
between
braces
branches
break
builtin
but
C
C like any other language has its blemishes Some of the operators have the wrong precedence some parts of the syntax could be better
call
called
calls
can
care
case
char
code
columnbased
comma
Comments
common
compiler
conditional
consisting
contain
contains
continue
control
controlflow
criticized
Cs
curly brackets
declarations
define
definitions
degree
delimiters
designated
directly
dowhile
each
effect
effects
either
enclosed
enclosing
end
entry
enum
evaluated
evaluation
evaluations
even
example
executed
execution
exert
expression
expressionExpressions
expressions
familiarity
file
followed
following
format
FORTRAN
freeform
function
functions
goto
has
high
However
identified
ifelse
imperative
include
including
initialization
innermost
int
integer
interleaved
Introduction
iterative
Kernighan
keywords
label
language
languages
languagesAlthough
leave
limit
lineEach
loop
looping
many
may
mimicked
modify
more
most
name
needed
new
next
nonstructured
normal
object
obtain
occur
often
omitted
on
operands
operator
operators
optimization
order
other
perhaps
permits
points
programmers
programming
provides
rather
reinitialization
reliable
requires
reserve
reserved
restrictions
results
return
Ritchie
say
scope
Sections
see
selects
semicolon
separate
sequence
sequence point
sequential
several
side
single
skip
sometimes
source
specify
statement
statements
storage
struct
Structured
structuresAs
such
supported
switch
syntax
testing
textlinebased
than
There
This
turn
type
types
union
Unlike
unspecified
use
used
uses
using
usually
value
values
variable
variables
variety
which
while
whitespace
widespread
will
within
writing
Python (258 chars as is, including 66 chars for first line and 30 chars for punctuation removal) :
W="and is the as of to or in for by an be may has can its".split()
x=3;y=2;d={}
for l in open('c.txt') :
for w in l.lower().translate(None,',.;\'"!()[]{}').split() :
if w not in W: d[w]=d.get(w,0)+1
for w,n in d.items() :
if n>y and len(w)>x : print n,w
output :
4 code
3 keywords
3 languages
3 execution
3 each
3 language
4 expression
4 statements
3 variables
7 statement
5 function
4 operators
4 declarations
3 programming
4 which
3 such
3 types
Here is my variant, in PHP:
$str = implode(file('c.txt'));
$tok = strtok($str, " .,;()\r\n\t");
$splitters = '\s.,\(\);?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );
foreach($array as $key) {
$res[$key] = $res[$key]+1;
}
$splitters = '\s.,\(\)\{\};?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );
foreach($array as $key) {
$res[$key] = $res[$key]+1;
}
unset($res['the']);
unset($res['and']);
unset($res['to']);
unset($res['of']);
unset($res['by']);
unset($res['a']);
unset($res['as']);
unset($res['is']);
unset($res['in']);
unset($res['']);
arsort($res);
//var_dump($res); // concordance
foreach ($res AS $word => $rarity)
echo $word . ' <b>x</b> ' . $rarity . '<br/>';
foreach ($array as $word) { // words longer than n (=5)
// if(strlen($word) > 5)echo $word.'<br/>';
}
And output:
statement x 7
be x 7
C x 5
may x 5
for x 5
or x 5
The x 5
as x 5
expression x 4
statements x 4
code x 4
function x 4
which x 4
an x 4
declarations x 3
new x 3
execution x 3
types x 3
such x 3
variables x 3
can x 3
languages x 3
operators x 3
end x 2
programming x 2
evaluated x 2
functions x 2
definitions x 2
keywords x 2
followed x 2
contain x 2
several x 2
side x 2
most x 2
has x 2
its x 2
called x 2
specify x 2
reinitialization x 2
use x 2
either x 2
each x 2
all x 2
built-in x 2
source x 2
are x 2
storage x 2
than x 2
effects x 1
including x 1
arguments x 1
order x 1
even x 1
unspecified x 1
evaluations x 1
operands x 1
interleaved x 1
However x 1
value x 1
branches x 1
goto x 1
directly x 1
designated x 1
label x 1
non-structured x 1
also x 1
enclosing x 1
innermost x 1
loop x 1
skip x 1
There x 1
within x 1
switch x 1
Expressions x 1
integer x 1
variety x 1
see x 1
below x 1
will x 1
on x 1
selects x 1
case x 1
executed x 1
based x 1
calls x 1
from x 1
because x 1
many x 1
widespread x 1
familiarity x 1
C's x 1
mimicked x 1
Although x 1
reliable x 1
obtain x 1
results x 1
needed x 1
other x 1
syntax x 1
often x 1
Introduction x 1
say x 1
Programming x 1
Language x 1
C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better. x 1
Ritchie x 1
Kernighan x 1
been x 1
criticized x 1
For x 1
example x 1
care x 1
more x 1
leave x 1
return x 1
call x 1
&& x 1
|| x 1
entry x 1
include x 1
next x 1
before x 1
sequence point x 1
sequence x 1
points x 1
comma x 1
operator x 1
but x 1
compiler x 1
requires x 1
programmers x 1
exert x 1
optimization x 1
object x 1
This x 1
permits x 1
high x 1
degree x 1
occur x 1
Structured x 1
using x 1
struct x 1
union x 1
enum x 1
define x 1
Declarations x 1
file x 1
contains x 1
Function x 1
turn x 1
assign x 1
perhaps x 1
Keywords x 1
char x 1
int x 1
Sections x 1
name x 1
variable x 1
reserve x 1
usually x 1
writing x 1
type x 1
Each x 1
line x 1
format x 1
rather x 1
column-based x 1
text-line-based x 1
whitespace x 1
arbitrary x 1
FORTRAN x 1
77 x 1
free-form x 1
allows x 1
restrictions x 1
Comments x 1
C99 x 1
following x 1
// x 1
until x 1
*/ x 1
/* x 1
appear x 1
between x 1
delimiters x 1
enclosed x 1
braces x 1
supported x 1
if x 1
-else x 1
conditional x 1
Unlike x 1
reserved x 1
sequential x 1
provides x 1
control-flow x 1
identified x 1
do-while x 1
while x 1
any x 1
omitted x 1
break x 1
continue x 1
expressions x 1
testing x 1
iterative x 1
looping x 1
separate x 1
initialization x 1
normal x 1
modify x 1
control x 1
structures x 1
As x 1
imperative x 1
single x 1
act x 1
sometimes x 1
curly brackets x 1
limit x 1
scope x 1
language x 1
uses x 1
evaluation x 1
assigned x 1
values x 1
To x 1
effect x 1
semicolon x 1
actions x 1
common x 1
consisting x 1
used x 1
var_dump statement simply displays concordance. This variant preserves double-quoted expressions.
For supplied file this code finishes in 0.047 seconds. Though larger file will consume lots of memory (because of file function).
This is not going to win any golfing awards but it does keep quoted phrases together and takes into account stop words (and leverages CPAN modules Lingua::StopWords and Text::ParseWords).
In addition, I use to_S from Lingua::EN::Inflect::Number to count only the singular forms of words.
You might also want to look at Lingua::CollinsParser.
#!/usr/bin/perl
use strict; use warnings;
use Lingua::EN::Inflect::Number qw( to_S );
use Lingua::StopWords qw( getStopWords );
use Text::ParseWords;
my $stop = getStopWords('en');
my %words;
while ( my $line = <> ) {
chomp $line;
next unless $line =~ /\S/;
next unless my #words = parse_line(' ', 1, $line);
++ $words{to_S $_} for
grep { length and not $stop->{$_} }
map { s!^[[:punct:]]+!!; s![[:punct:]]+\z!!; lc }
#words;
}
print "=== only words appearing 4 or more times ===\n";
print "$_ : $words{$_}\n" for sort {
$words{$b} <=> $words{$a}
} grep { $words{$_} > 3 } keys %words;
print "=== only words that are 12 characters or longer ===\n";
print "$_ : $words{$_}\n" for sort {
$words{$b} <=> $words{$a}
} grep { 11 < length } keys %words;
Output:
=== only words appearing 4 or more times ===
statement : 11
function : 7
expression : 6
may : 5
code : 4
variable : 4
operator : 4
declaration : 4
c : 4
type : 4
=== only words that are 12 characters or longer ===
reinitialization : 2
control-flow : 1
sequence point : 1
optimization : 1
curly brackets : 1
text-line-based : 1
non-structured : 1
column-based : 1
initialization : 1

Resources