How to refactor this in J? - refactoring

Here is a different approach for the Project Euler #1 solution:
+/~.(3*i.>.1000%3),5*i.>.1000%5
How to refactor it?

[:+/#~.#,3 5([*i.#>.#%~)]
usage example:
f =: [:+/#~.#,3 5([*i.#>.#%~)]
f 1000
or
+/~.,3 5([*i.#>.#%~)1000
%~ = 4 : 'y % x'
i.#>.#%~ = 4 : 'i. >. y % x'
[*i.#>.#%~ = 4 : 'x * i. >. y % x'
3 5([*i.#>.#%~)] = 3 : '3 5 * i. >. y % 3 5'
[:+/#~.#,3 5([*i.#>.#%~)] = 3 : '+/ ~. , 3 5 * i. >. y % 3 5'

+/(#~ ( (0= 3| ]) +. (0 = 5 |]) )) 1+i.999
0 = ( 3 | ]) uses (twice) the trick of verb train (fork) with n u v (discussed at the end of http://www.jsoftware.com/help/learning/09.htm)
A different way of writing it:
+/(#~ ( ((0&=) # (3&|)) +. ((0&=) # (5&|)))) 1+i.999

Here is another approach, using a simple, generic verb
multiplesbelow =: 4 : 'I. 0 = x | i.y'
+/ ~. ,3 5 multiplesbelow"0 [ 1000

Related

Finding natural numbers having n Trailing Zeroes in Factorial

I need help with the following problem.
Given an integer m, I need to find the number of positive integers n and the integers, such that the factorial of n ends with exactly m zeroes.
I wrote this code it works fine and i get the right output, but it take way too much time as the numbers increase.
a = input()
while a:
x = []
m, n, fact, c, j = input(), 0, 1, 0, 0
z = 10*m
t = 10**m
while z - 1:
fact = 1
n = n + 1
for i in range(1, n + 1):
fact = fact * i
if fact % t == 0 and ((fact / t) % 10) != 0:
x.append(int(n))
c = c + 1
z = z - 1
for p in range(c):
print x[p],
a -= 1
print c
Could someone suggest me a more efficient way to do this. Presently, it takes 30 seconds for a test case asking for numbers with 250 trailing zeros in its factorial.
Thanks
To get number of trailing zeroes of n! efficiently you can put
def zeroes(value):
result = 0;
d = 5;
while (d <= value):
result += value // d; # integer division
d *= 5;
return result;
...
# 305: 1234! has exactly 305 trailing zeroes
print zeroes(1234)
In order to solve the problem (what numbers have n trailing zeroes in n!) you can use these facts:
number of zeroes is a monotonous function: f(x + a) >= f(x) if a >= 0.
if f(x) = y then x <= y * 5 (we count only 5 factors).
if f(x) = y then x >= y * 4 (let me leave this for you to prove)
Then implement binary search (on monotonous function).
E.g. in case of 250 zeroes we have the initial range to test [4*250..5*250] == [1000..1250]. Binary search narrows the range down into [1005..1009].
1005, 1006, 1007, 1008, 1009 are all numbers such that they have exactly 250 trainling zeroes in factorial
Edit I hope I don't spoil the fun if I (after 2 years) prove the last conjecture (see comments below):
Each 5**n within facrtorial when multiplied by 2**n produces 10**n and thus n zeroes; that's why f(x) is
f(x) = [x / 5] + [x / 25] + [x / 125] + ... + [x / 5**n] + ...
where [...] stands for floor or integer part (e.g. [3.1415926] == 3). Let's perform easy manipulations:
f(x) = [x / 5] + [x / 25] + [x / 125] + ... + [x / 5**n] + ... <= # removing [...]
x / 5 + x / 25 + x / 125 + ... + x / 5**n + ... =
x * (1/5 + 1/25 + 1/125 + ... + 1/5**n + ...) =
x * (1/5 * 1/(1 - 1/5)) =
x * 1/5 * 5/4 =
x / 4
So far so good
f(x) <= x / 4
Or if y = f(x) then x >= 4 * y Q.E.D.
Focus on the number of 2s and 5s that makes up a number. e.g. 150 is made up of 2*3*5*5, there 1 pair of 2&5 so there's one trailing zero. Each time you increase the tested number, try figuring out how much 2 and 5s are in the number. From that, adding up previous results you can easily know how much zeros its factorial contains.
For example, 15!=15*...*5*4*3*2*1, starting from 2:
Number 2s 5s trailing zeros of factorial
2 1 0 0
3 1 0 0
4 2 0 0
5 2 1 1
6 3 1 1
...
10 5 2 2
...
15 7 3 3
..
24 12 6 6
25 12 8 8 <- 25 counts for two 5-s: 25 == 5 * 5 == 5**2
26 13 8 8
..
Refer to Peter de Rivaz's and Dmitry Bychenko's comments, they have got some good advices.

Haskell: How to change algorithm to work on any size of list?

I have this code:
project= [
[(a,b),(c,d),(e,f)]
|
a<-[1..5],
b<-[1..3],
c<-[1..5],
d<-[1..3],
e<-[1..5],
f<-[1..3]
, a*b + c*d + e*f <6
, a + c + e == 5
, b == 3 || d==3 || f==3
]
x=take 1 project
main = print $ x
it is return a list of 3 pairs [(x,y),(x,y),(x,y)] .
There are 3 conditions:
If you sum all the x you must get 5.
If you sum all the x*y you will get less than 6.
There is at least one y that equal to 3.
Now, I want exactly the same algorithm to work for any longer list for example 10 pairs. How should I do that?
Here:
project n =
[ x
| x <- replicateM n $ liftA2 (,) [1..5] [1..3]
, sum (map (uncurry (*)) x) < 6
, sum (map fst x) == 5
, any ((==3) . snd) x
]
main = print $ take 1 $ project 3
Or like so:
project n
= filter (any ((==3) . snd))
$ filter ((==5) . sum . map fst)
$ filter ((<6) . sum . map (uncurry (*)))
$ replicateM n
$ liftA2 (,) [1..5] [1..3]

How to solve a simple linear equation using postfix

I have a program that takes in a simple linear equation and transforms it into its equivalent in postfix.
For example:
3x+7=4(2x-1)
would be transformed into
3 x * 7 + = 4 2 x * 1 - *
How can i get the value of x in this example using its postfix form. Any help will be greatly appreciated thank you
EDIT - I need help with the logic not the code (I'm not asking for people to do the code for me)
If your linear equation is always in the form of a right hand side (RHS) in terms of x and a left hand side (LHS) in terms of x, then the following would work.
Subtract the LHS from both the LHS and the RHS. Then you have 0 on the LHS and an expression in terms of x on the RHS.
Begin to simplify the postfix expression. Every time you encounter an addition or subtraction operation with a numerical operand, add or subtract that value from the LHS as appropriate, and replace the operand in the calculation with 0.
At the end you should be left with an equation in the form of b = a * x. The solution (if one exists and is unique) is then b / a.
Same as algebra, first let's get it into a simplified form:
3 x * 7 + = 4 2 x * 1 - *
We see a = 2 x *, then b = a 1 -, leaving 4 b *. Multiply each term in b:
3 x * 7 + = 2 4 * x * 1 4 * -
3 x * 7 + = 8 x * 4 -
Do the same on the left:
3 7 / x * 1 + = 8 x * 4 -
Now subtract 1 from each side by removing a top-level 1 + or otherwise altering some top-level addition:
3 7 / x * = 8 x * 5 -
And subtract 8 x *:
3 7 / x * 8 x * - = 0 5 -
Move things around and multiply by -1:
8 x * 3 7 / x * - = 5
Note: multiplying by -1 is easy. Algebraic notation:
(a - b) * -1 = (0 + (a - b)) * -1
(a - b) * -1 = -1*0 + (-1*a - -1*b)
(a - b) * -1 = 0 + (-a - -b)
(a - b) * -1 = (-a + b)
(a - b) * -1 = b - a
I tried using this once to fix a mistake way down the line in linear algebra and lost several points on the test because the instructor said I can't just claim -(a-b) = (b-a) so I had to prove 0-x = -x I guess.
In reverse polish, a b - 0 - = b a - 0 +. Because x is common, reorder the multiplication:
8 3 7 / - x * = 5
53 7 / x * = 5
Divide both sides by 53 / 7:
x = 5 53 7 / *
x = 5 53 * 7 /
x = 265 7 /
x = 37 6 7 / +
Solve for x.

Trying to create an efficient algorithm for a function in Haskell

I'm looking for an efficient polynomial-time solution to the following problem:
Implement a recursive function node x y for calculating the (x,y)-th number in a number triangle defined as
g(x,y) = 0 if |x| > y
= 1 if (x,y) = (0,0)
= sum of all incoming paths otherwise
The sum of all incoming paths to a node is defined as the sum of the values of all possible paths from the root node (x, y) = (0, 0) to the node under consideration, where at each node (x,y) a path can either continue diagonally down and left (x−1,y+1), straight down (x,y+1), or diagonally down and right (x+1,y+1). The value of a path to a node is defined as the sum of all the nodes along that path up to, but not including, the node under consideration.
The first few entries in the number triangle are given in the table:
\ x -3 -2 -1 0 1 2 3
\
y \ _________________________
|
0 | 0 0 0 1 0 0 0
|
1 | 0 0 1 1 1 0 0
|
2 | 0 2 4 6 4 2 0
|
3 | 4 16 40 48 40 16 4
I am trying to work out a naive solution first, here is what I have:
node x y | y < 0 = error "number cannot be negative"
| (abs x) > y = 0
| (x == 0) && (y == 0) = 1
| otherwise = node (x+1) (y-1) + node x (y-1) + node (x-1) (y-1)
Whenever I run this I get:
"* Exception: stack overflow"?
I believe your problem is a bit more complicated than your example code suggests. First, let's be clear about some definitions here:
Let pathCount x y be the number of paths that end at (x, y). We have
pathCount :: Int -> Int -> Integer
pathCount x y
| y == 0 = if x == 0 then 1 else 0
| otherwise = sum [ pathCount (x + d) (y - 1) | d <- [-1..1]]
Now let's pathSum x y be the sum of all paths that end in (x, y). We have:
pathSum :: Int -> Int -> Integer
pathSum x y
| y == 0 = if x == 0 then 1 else 0
| otherwise = sum [ pathSum (x + d) (y - 1) + node x y * pathCount (x + d) (y - 1)
| d <- [-1..1] ]
With this helper, we can finally define node x y properly:
node :: Int -> Int -> Integer
node x y
| y == 0 = if x == 0 then 1 else 0
| otherwise = sum [ pathSum (x + d) (y - 1) | d <- [-1..1]]
This algorithm as such is exponential time in its current form. We can however add memoization to make the number of additions quadratic. The memoize package on Hackage makes this easy as pie. Full example:
import Control.Monad
import Data.List (intercalate)
import Data.Function.Memoize (memoize2)
node' :: Int -> Int -> Integer
node' x y
| y == 0 = if x == 0 then 1 else 0
| otherwise = sum [ pathSum (x + d) (y - 1) | d <- [-1..1]]
node = memoize2 node'
pathCount' :: Int -> Int -> Integer
pathCount' x y
| y == 0 = if x == 0 then 1 else 0
| otherwise = sum [ pathCount (x + d) (y - 1) | d <- [-1..1]]
pathCount = memoize2 pathCount'
pathSum' :: Int -> Int -> Integer
pathSum' x y
| y == 0 = if x == 0 then 1 else 0
| otherwise = sum [ pathSum (x + d) (y - 1) + node x y * pathCount (x + d) (y - 1)
| d <- [-1..1] ]
pathSum = memoize2 pathSum'
main =
forM_ [0..n] $ \y ->
putStrLn $ intercalate " " $ map (show . flip node y) [-n..n]
where n = 5
Output:
0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 1 1 1 0 0 0 0
0 0 0 2 4 6 4 2 0 0 0
0 0 4 16 40 48 40 16 4 0 0
0 8 72 352 728 944 728 352 72 8 0
16 376 4248 16608 35128 43632 35128 16608 4248 376 16
As you can see the algorithm the size of the numbers will get out of hands rather quickly. So the runtime is not O(n^2), while the number of arithmetic operations is.
You're thinking in terms of outgoing paths, when you should be thinking in terms of incoming paths. Your recursive step is currently looking for nodes from below, instead of above.
First of all, sorry if this is long. I wanted to explain the step by step thought process.
To start off with, you need one crucial fact: You can represent the "answer" at each "index" by a list of paths. For all the zeros, this is [[]], for your base case it is [[1]], and for example, for 0,2 it is [[6,1,1],[6,1,1],[6,1,1]]. This may seem like some redundancy, but it simplifies things down the road. Then, extracting the answer is head . head if the list is non empty, or const 0 if it is.
This is very useful because you can store the answer as a list of rows (the first row would be '[[1]], [], [] ...) and the results of any given row depend only on the previous row.
Secondly, this problem is symmetrical. This is pretty obvious.
The first thing we will do will mirror the definition of fib very closely:
type Path = [[Integer]]
triangle' :: [[Path]]
triangle' = ([[1]] : repeat []) : map f triangle'
We know this must be close to correct, since the 2nd row will depend on the first row only, the third on the 2nd only, etc. So the result will be
([[1]] : repeat []) : f ([[1]] : repeat []) : f ....
Now we just need to know what f is. Firstly, its type: [Path] -> [Path]. Quite simply, given the previous row, return the next row.
Now you may see another problem arising. Each invocation of f needs to know how many columns in the current row. We could actually count the length of non-null elements in the previous row, but it is simpler to pass the parameter directly, so we change map f triangle' to zipWith f [1..] triangle', giving f the type Int -> [Path] -> [Path].
f needs to handle one special case and one general case. The special case is x=0, in this case we simply treat the x+1,y-1 and x-1,y-1 recursions the same, and otherwise is identical to gn. Lets make two functions, g0 and gn which handle these two cases.
The actually computation of gn is easy. We know for some x we need the elements x-1, x, x+1 of the previous row. So if we drop x-1 elements before giving the previous row to the xth invocation of gn, gn can just take the first 3 elements and it will have what it needs. We write this as follows:
f :: Int -> [Path] -> [Path]
f n ps = g0 ps : map (gn . flip drop ps) [0..n-1] ++ repeat []
The repeat [] at the end should be obvious: for indices outside the triangle, the result is 0.
Now writing g0 and gs is really quite simple:
g0 :: [Path] -> Path
g0 (a:b:_) = map (s:) q
where
s = sum . concat $ q
q = b ++ a ++ b
gn :: [Path] -> Path
gn (a:b:c:_) = map (s:) q
where
s = sum . concat $ q
q = a ++ b ++ c
On my machine this version is about 3-4 times faster than the fastest version I could write with normal recursion and memoization.
The rest is just printing or pulling out the number you want.
triangle :: Int -> Int -> Integer
triangle x y = case (triangle' !! y) !! (abs x) of
[] -> 0
xs -> head $ head xs
triList :: Int -> Int -> Path
triList x y = (triangle' !! y) !! (abs x)
printTri :: Int -> Int -> IO ()
printTri width height =
putStrLn $ unlines $ map unwords
[[ p $ triangle x y | x <- [-x0..x0]] | y <- [0..height]]
where maxLen = length $ show $ triangle 0 height
x0 = width `div` 2
p = printf $ "%" ++ show maxLen ++ "d "

Code Golf: Quickly Build List of Keywords from Text, Including # of Instances

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc).
Return only words with an occurrence greater than X
Return only words with a length greater than Y
Ignore common terms like "and, is, the, etc"
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John")
Return results in a collection/array
Extra Credit
Keep Quoted Statements together, (ie. "They were 'too good to be true' apparently")Where 'too good to be true' would be the actual statement
Extra-Extra Credit
Can your script determine words that should be kept together based upon their frequency of being found together? This being done without knowing the words beforehand. Example:
*"The fruit fly is a great thing when it comes to medical research. Much study has been done on the fruit fly in the past, and has lead to many breakthroughs. In the future, the fruit fly will continue to be studied, but our methods may change."*
Clearly the word here is "fruit fly," which is easy for us to find. Can your search'n'scrape script determine this too?
Source text: http://sampsonresume.com/labs/c.txt
Answer Format
It would be great to see the results of your code, output, in addition to how long the operation lasted.
GNU scripting
sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | sort -nr
Results:
7 be
6 to
[...]
1 2.
1 -
With occurence greater than X:
sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | awk '$1>X'
Return only words with a length greater than Y (put Y+1 dots in second grep):
sed -e 's/ /\n/g' | grep -v '^ *$' | grep .... | sort | uniq -c
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
sed -e 's/ /\n/g' | grep -v '^ *$' | grep -vf ignored | sort | uniq -c
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
sed -e 's/[,.:"\']//g;s/ /\n/g' | grep -v '^ *$' | sort | uniq -c
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
Perl in only 43 characters.
perl -MYAML -anE'$_{$_}++for#F;say Dump\%_'
Here is an example of it's use:
echo a a a b b c d e aa | perl -MYAML -anE'$_{$_}++for#F;say Dump \%_'
---
a: 3
aa: 1
b: 2
c: 1
d: 1
e: 1
If you need to list only the lowercase versions, it requires two more characters.
perl -MYAML -anE'$_{lc$_}++for#F;say Dump\%_'
For it to work on the specified text requires 58 characters.
curl http://sampsonresume.com/labs/c.txt |
perl -MYAML -F'\W+' -anE'$_{lc$_}++for#F;END{say Dump\%_}'
real 0m0.679s
user 0m0.304s
sys 0m0.084s
Here is the last example expanded a bit.
#! perl
use 5.010;
use YAML;
while( my $line = <> ){
for my $elem ( split '\W+', $line ){
$_{ lc $elem }++
}
END{
say Dump \%_;
}
}
F#: 304 chars
let f =
let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"]
fun length occurrence msg ->
System.Text.RegularExpressions.Regex.Split(msg, #"[^\w-']+")
|> Seq.countBy (fun a -> a)
|> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)
Ruby
When "minified", this implementation becomes 165 characters long. It uses array#inject to give a starting value (a Hash object with a default of 0) and then loop through the elements, which are then rolled into the hash; the result is then selected from the minimum frequency.
Note that I didn't count the size of the words to skip, that being an external constant. When the constant is counted too, the solution is 244 characters long.
Apostrophes and dashes aren't stripped, but included; their use modifies the word and therefore cannot be stripped simply without removal of all information beyond the symbol.
Implementation
CommonWords = %w(the a an but and is not or as of to in for by be may has can its it's)
def get_keywords(text, minFreq=0, minLen=2)
text.scan(/(?:\b)[a-z'-]{#{minLen},}(?=\b)/i).
inject(Hash.new(0)) do |result,w|
w.downcase!
result[w] += 1 unless CommonWords.include?(w)
result
end.select { |k,n| n >= minFreq }
end
Test Rig
require 'net/http'
keywords = get_keywords(Net::HTTP.get('www.sampsonresume.com','/labs/c.txt'), 3)
keywords.sort.each { |name,count| puts "#{name} x #{count} times" }
Test Results
code x 4 times
declarations x 4 times
each x 3 times
execution x 3 times
expression x 4 times
function x 5 times
keywords x 3 times
language x 3 times
languages x 3 times
new x 3 times
operators x 4 times
programming x 3 times
statement x 7 times
statements x 4 times
such x 3 times
types x 3 times
variables x 3 times
which x 4 times
C# 3.0 (with LINQ)
Here's my solution. It makes use of some pretty nice features of LINQ/extension methods to keep the code short.
public static Dictionary<string, int> GetKeywords(string text, int minCount, int minLength)
{
var commonWords = new string[] { "and", "is", "the", "as", "of", "to", "or", "in",
"for", "by", "an", "be", "may", "has", "can", "its"};
var words = Regex.Replace(text.ToLower(), #"[,.?\/;:\(\)]", string.Empty).Split(' ');
var occurrences = words.Distinct().Except(commonWords).Select(w =>
new { Word = w, Count = words.Count(s => s == w) });
return occurrences.Where(wo => wo.Count >= minCount && wo.Word.Length >= minLength)
.ToDictionary(wo => wo.Word, wo => wo.Count);
}
This is however far from the most efficient method, being O(n^2) with the number of words, rather than O(n), which is optimal in this case I believe. I'll see if I can creater a slightly longer method that is more efficient.
Here are the results of the function run on the sample text (min occurences: 3, min length: 2).
3 x such
4 x code
4 x which
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
7 x statement
3 x language
3 x expression
3 x execution
3 x programming
4 x operators
3 x variables
And my test program:
static void Main(string[] args)
{
string sampleText;
using (var client = new WebClient())
sampleText = client.DownloadString("http://sampsonresume.com/labs/c.txt");
var keywords = GetKeywords(sampleText, 3, 2);
foreach (var entry in keywords)
Console.WriteLine("{0} x {1}", entry.Value.ToString().PadLeft(3), entry.Key);
Console.ReadKey(true);
}
#! perl
use strict;
use warnings;
while (<>) {
for my $word (split) {
$words{$word}++;
}
}
for my $word (keys %words) {
print "$word occurred $words{$word} times.";
}
That's the simple form. If you want sorting, filtering, etc.:
while (<>) {
for my $word (split) {
$words{$word}++;
}
}
for my $word (keys %words) {
if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
print "$word occurred $words{$word} times.";
}
}
You can also sort the output pretty easily:
...
for my $word (keys %words) {
if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
push #output, "$word occurred $words{$word} times.";
}
}
$re = qr/occurred (\d+) /;
print sort {
$a = $a =~ $re;
$b = $b =~ $re;
$a <=> $b
} #output;
A true Perl hacker will easily get these on one or two lines each, but I went for readability.
Edit: this is how I would rewrite this last example
...
for my $word (
sort { $words{$a} <=> $words{$b} } keys %words
){
next unless length($word) >= $MINLEN;
last unless $words{$word) >= $MIN_OCCURRENCE;
print "$word occurred $words{$word} times.";
}
Or if I needed it to run faster I might even write it like this:
for my $word_data (
sort {
$a->[1] <=> $b->[1] # numerical sort on count
} grep {
# remove values that are out of bounds
length($_->[0]) >= $MINLEN && # word length
$_->[1] >= $MIN_OCCURRENCE # count
} map {
# [ word, count ]
[ $_, $words{$_} ]
} keys %words
){
my( $word, $count ) = #$word_data;
print "$word occurred $count times.";
}
It uses map for efficiency,
grep to remove extra elements,
and sort to do the sorting, of course. ( it does so it in that order )
This is a slight variant of the Schwartzian transform.
Another Python solution, at 247 chars. The actual code is a single line of highly dense Python line of 134 chars that computes the whole thing in a single expression.
x=3;y=2;W="and is the as of to or in for by an be may has can its".split()
from itertools import groupby as gb
d=dict((w,l)for w,l in((w,len(list(g)))for w,g in
gb(sorted(open("c.txt").read().lower().split())))
if l>x and len(w)>y and w not in W)
A much longer version with plenty of comments for you reading pleasure:
# High and low count boundaries.
x = 3
y = 2
# Common words string split into a list by spaces.
Words = "and is the as of to or in for by an be may has can its".split()
# A special function that groups similar strings in a list into a
# (string, grouper) pairs. Grouper is a generator of occurences (see below).
from itertools import groupby
# Reads the entire file, converts it to lower case and splits on whitespace
# to create a list of words
sortedWords = sorted(open("c.txt").read().lower().split())
# Using the groupby function, groups similar words together.
# Since grouper is a generator of occurences we need to use len(list(grouper))
# to get the word count by first converting the generator to a list and then
# getting the length of the list.
wordCounts = ((word, len(list(grouper))) for word, grouper in groupby(sortedWords))
# Filters the words by number of occurences and common words using yet another
# list comprehension.
filteredWordCounts = ((word, count) for word, count in wordCounts if word not in Words and count > x and len(word) > y)
# Creates a dictionary from the list of tuples.
result = dict(filteredWordCounts)
print result
The main trick here is using the itertools.groupby function to count the occurrences on a sorted list. Don't know if it really saves characters, but it does allow all the processing to happen in a single expression.
Results:
{'function': 4, 'operators': 4, 'declarations': 4, 'which': 4, 'statement': 5}
C# code:
IEnumerable<KeyValuePair<String, Int32>> ProcessText(String text, int X, int Y)
{
// common words, that will be ignored
var exclude = new string[] { "and", "is", "the", "as", "of", "to", "or", "in", "for", "by", "an", "be", "may", "has", "can", "its" }.ToDictionary(word => word);
// regular expression to find quoted text
var regex = new Regex("\"[^\"]\"", RegexOptions.Compiled);
return
// remove quoted text (it will be processed later)
regex.Replace(text, "")
// remove case dependency
.ToLower()
// split text by all these chars
.Split(".,'\\/[]{}()`~##$%^&*-=+?!;:<>| \n\r".ToCharArray())
// add quoted text
.Concat(regex.Matches(text).Cast<Match>().Select(match => match.Value))
// group words by the word and count them
.GroupBy(word => word, (word, words) => new KeyValuePair<String, Int32>(word, words.Count()))
// apply filter(min word count and word length) and remove common words
.Where(pair => pair.Value >= X && pair.Key.Length >= Y && !exclude.ContainsKey(pair.Key));
}
Output for ProcessText(text, 3, 2) call:
3 x languages
3 x such
4 x code
4 x which
3 x based
3 x each
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
3 x variables
7 x statement
4 x expression
3 x execution
3 x programming
3 x operators
In C#:
Use LINQ, specifically groupby, then filter by group count, and return a flattened (selectmany) list.
Use LINQ, filter by length.
Use LINQ, filter with 'badwords'.Contains.
REBOL
Verbose, perhaps, so definitely not a winner, but gets the job done.
min-length: 0
min-count: 0
common-words: [ "a" "an" "as" "and" "are" "by" "for" "from" "in" "is" "it" "its" "the" "of" "or" "to" "until" ]
add-word: func [
word [string!]
/local
count
letter
non-letter
temp
rules
match
][
; Strip out punctuation
temp: copy {}
letter: charset [ #"a" - #"z" #"A" - #"Z" #" " ]
non-letter: complement letter
rules: [
some [
copy match letter (append temp match)
|
non-letter
]
]
parse/all word rules
word: temp
; If we end up with nothing, bail
if 0 == length? word [
exit
]
; Check length
if min-length > length? word [
exit
]
; Ignore common words
ignore:
if find common-words word [
exit
]
; OK, its good. Add it.
either found? count: select words word [
words/(word): count + 1
][
repend words [word 1]
]
]
rules: [
some [
{"}
copy word to {"} (add-word word)
{"}
|
copy word to { } (add-word word)
{ }
]
end
]
words: copy []
parse/all read %c.txt rules
result: copy []
foreach word words [
if string? word [
count: words/:word
if count >= min-count [
append result word
]
]
]
sort result
foreach word result [ print word ]
The output is:
act
actions
all
allows
also
any
appear
arbitrary
arguments
assign
assigned
based
be
because
been
before
below
between
braces
branches
break
builtin
but
C
C like any other language has its blemishes Some of the operators have the wrong precedence some parts of the syntax could be better
call
called
calls
can
care
case
char
code
columnbased
comma
Comments
common
compiler
conditional
consisting
contain
contains
continue
control
controlflow
criticized
Cs
curly brackets
declarations
define
definitions
degree
delimiters
designated
directly
dowhile
each
effect
effects
either
enclosed
enclosing
end
entry
enum
evaluated
evaluation
evaluations
even
example
executed
execution
exert
expression
expressionExpressions
expressions
familiarity
file
followed
following
format
FORTRAN
freeform
function
functions
goto
has
high
However
identified
ifelse
imperative
include
including
initialization
innermost
int
integer
interleaved
Introduction
iterative
Kernighan
keywords
label
language
languages
languagesAlthough
leave
limit
lineEach
loop
looping
many
may
mimicked
modify
more
most
name
needed
new
next
nonstructured
normal
object
obtain
occur
often
omitted
on
operands
operator
operators
optimization
order
other
perhaps
permits
points
programmers
programming
provides
rather
reinitialization
reliable
requires
reserve
reserved
restrictions
results
return
Ritchie
say
scope
Sections
see
selects
semicolon
separate
sequence
sequence point
sequential
several
side
single
skip
sometimes
source
specify
statement
statements
storage
struct
Structured
structuresAs
such
supported
switch
syntax
testing
textlinebased
than
There
This
turn
type
types
union
Unlike
unspecified
use
used
uses
using
usually
value
values
variable
variables
variety
which
while
whitespace
widespread
will
within
writing
Python (258 chars as is, including 66 chars for first line and 30 chars for punctuation removal) :
W="and is the as of to or in for by an be may has can its".split()
x=3;y=2;d={}
for l in open('c.txt') :
for w in l.lower().translate(None,',.;\'"!()[]{}').split() :
if w not in W: d[w]=d.get(w,0)+1
for w,n in d.items() :
if n>y and len(w)>x : print n,w
output :
4 code
3 keywords
3 languages
3 execution
3 each
3 language
4 expression
4 statements
3 variables
7 statement
5 function
4 operators
4 declarations
3 programming
4 which
3 such
3 types
Here is my variant, in PHP:
$str = implode(file('c.txt'));
$tok = strtok($str, " .,;()\r\n\t");
$splitters = '\s.,\(\);?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );
foreach($array as $key) {
$res[$key] = $res[$key]+1;
}
$splitters = '\s.,\(\)\{\};?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );
foreach($array as $key) {
$res[$key] = $res[$key]+1;
}
unset($res['the']);
unset($res['and']);
unset($res['to']);
unset($res['of']);
unset($res['by']);
unset($res['a']);
unset($res['as']);
unset($res['is']);
unset($res['in']);
unset($res['']);
arsort($res);
//var_dump($res); // concordance
foreach ($res AS $word => $rarity)
echo $word . ' <b>x</b> ' . $rarity . '<br/>';
foreach ($array as $word) { // words longer than n (=5)
// if(strlen($word) > 5)echo $word.'<br/>';
}
And output:
statement x 7
be x 7
C x 5
may x 5
for x 5
or x 5
The x 5
as x 5
expression x 4
statements x 4
code x 4
function x 4
which x 4
an x 4
declarations x 3
new x 3
execution x 3
types x 3
such x 3
variables x 3
can x 3
languages x 3
operators x 3
end x 2
programming x 2
evaluated x 2
functions x 2
definitions x 2
keywords x 2
followed x 2
contain x 2
several x 2
side x 2
most x 2
has x 2
its x 2
called x 2
specify x 2
reinitialization x 2
use x 2
either x 2
each x 2
all x 2
built-in x 2
source x 2
are x 2
storage x 2
than x 2
effects x 1
including x 1
arguments x 1
order x 1
even x 1
unspecified x 1
evaluations x 1
operands x 1
interleaved x 1
However x 1
value x 1
branches x 1
goto x 1
directly x 1
designated x 1
label x 1
non-structured x 1
also x 1
enclosing x 1
innermost x 1
loop x 1
skip x 1
There x 1
within x 1
switch x 1
Expressions x 1
integer x 1
variety x 1
see x 1
below x 1
will x 1
on x 1
selects x 1
case x 1
executed x 1
based x 1
calls x 1
from x 1
because x 1
many x 1
widespread x 1
familiarity x 1
C's x 1
mimicked x 1
Although x 1
reliable x 1
obtain x 1
results x 1
needed x 1
other x 1
syntax x 1
often x 1
Introduction x 1
say x 1
Programming x 1
Language x 1
C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better. x 1
Ritchie x 1
Kernighan x 1
been x 1
criticized x 1
For x 1
example x 1
care x 1
more x 1
leave x 1
return x 1
call x 1
&& x 1
|| x 1
entry x 1
include x 1
next x 1
before x 1
sequence point x 1
sequence x 1
points x 1
comma x 1
operator x 1
but x 1
compiler x 1
requires x 1
programmers x 1
exert x 1
optimization x 1
object x 1
This x 1
permits x 1
high x 1
degree x 1
occur x 1
Structured x 1
using x 1
struct x 1
union x 1
enum x 1
define x 1
Declarations x 1
file x 1
contains x 1
Function x 1
turn x 1
assign x 1
perhaps x 1
Keywords x 1
char x 1
int x 1
Sections x 1
name x 1
variable x 1
reserve x 1
usually x 1
writing x 1
type x 1
Each x 1
line x 1
format x 1
rather x 1
column-based x 1
text-line-based x 1
whitespace x 1
arbitrary x 1
FORTRAN x 1
77 x 1
free-form x 1
allows x 1
restrictions x 1
Comments x 1
C99 x 1
following x 1
// x 1
until x 1
*/ x 1
/* x 1
appear x 1
between x 1
delimiters x 1
enclosed x 1
braces x 1
supported x 1
if x 1
-else x 1
conditional x 1
Unlike x 1
reserved x 1
sequential x 1
provides x 1
control-flow x 1
identified x 1
do-while x 1
while x 1
any x 1
omitted x 1
break x 1
continue x 1
expressions x 1
testing x 1
iterative x 1
looping x 1
separate x 1
initialization x 1
normal x 1
modify x 1
control x 1
structures x 1
As x 1
imperative x 1
single x 1
act x 1
sometimes x 1
curly brackets x 1
limit x 1
scope x 1
language x 1
uses x 1
evaluation x 1
assigned x 1
values x 1
To x 1
effect x 1
semicolon x 1
actions x 1
common x 1
consisting x 1
used x 1
var_dump statement simply displays concordance. This variant preserves double-quoted expressions.
For supplied file this code finishes in 0.047 seconds. Though larger file will consume lots of memory (because of file function).
This is not going to win any golfing awards but it does keep quoted phrases together and takes into account stop words (and leverages CPAN modules Lingua::StopWords and Text::ParseWords).
In addition, I use to_S from Lingua::EN::Inflect::Number to count only the singular forms of words.
You might also want to look at Lingua::CollinsParser.
#!/usr/bin/perl
use strict; use warnings;
use Lingua::EN::Inflect::Number qw( to_S );
use Lingua::StopWords qw( getStopWords );
use Text::ParseWords;
my $stop = getStopWords('en');
my %words;
while ( my $line = <> ) {
chomp $line;
next unless $line =~ /\S/;
next unless my #words = parse_line(' ', 1, $line);
++ $words{to_S $_} for
grep { length and not $stop->{$_} }
map { s!^[[:punct:]]+!!; s![[:punct:]]+\z!!; lc }
#words;
}
print "=== only words appearing 4 or more times ===\n";
print "$_ : $words{$_}\n" for sort {
$words{$b} <=> $words{$a}
} grep { $words{$_} > 3 } keys %words;
print "=== only words that are 12 characters or longer ===\n";
print "$_ : $words{$_}\n" for sort {
$words{$b} <=> $words{$a}
} grep { 11 < length } keys %words;
Output:
=== only words appearing 4 or more times ===
statement : 11
function : 7
expression : 6
may : 5
code : 4
variable : 4
operator : 4
declaration : 4
c : 4
type : 4
=== only words that are 12 characters or longer ===
reinitialization : 2
control-flow : 1
sequence point : 1
optimization : 1
curly brackets : 1
text-line-based : 1
non-structured : 1
column-based : 1
initialization : 1

Resources