i have a corpus of text files, contains just text, I want to extract the ngrams from the texts and save each one with his original file name in matrixes of 3 columns..
library(tokenizer)
myTokenizer <- function(x, n, n_min) {
corp<-"this is a full text "
tok <- unlist(tokenize_ngrams(as.character(x), n = n, n_min = n_min))
M <- matrix(nrow=length(tok), ncol=3,
dimnames=list(NULL, c( "gram" , "num.words", "words")))
}
corp <- tm_map(corp,content_transformer(function (x) myTokenizer(x, n=3, n_min=1)))
writecorpus(corp)
Since I don't have your corpus I created one of my own using the crude dataset from tm. No need to use tm_map as that keeps the data in a corpus format. The tokenizer package can handle this.
What I do is store all your desired matrices in a list object via lapply and then use sapply to store the data in the crude directory as separate files.
Do realize that the matrices as specified in your function will be character matrices. This means that columns 1 and 2 will be characters, not numbers.
library(tm)
data("crude")
crude <- as.VCorpus(crude)
myTokenizer <- function(x, n, n_min) {
tok <- unlist(tokenizers::tokenize_ngrams(as.character(x), n = n, n_min = n_min))
M <- matrix(nrow=length(tok), ncol=3,
dimnames=list(NULL, c( "gram" , "num.words", "words")))
M[, 3] <- tok
M[, 2] <- lengths(strsplit(M[, 3], "\\W+")) # counts the words
M[, 1] <- 1:length(tok)
return(M)
}
my_matrices <- lapply(crude, myTokenizer, n = 3, n_min = 1)
# make sure directory crude exists as a subfolder in working directory
sapply(names(my_matrices),
function (x) write.table(my_matrices[[x]], file=paste("crude/", x, ".txt", sep=""), row.names = FALSE))
outcome of the first file:
"gram" "num.words" "words"
"1" "1" "diamond"
"2" "2" "diamond shamrock"
"3" "3" "diamond shamrock corp"
"4" "1" "shamrock"
"5" "2" "shamrock corp"
"6" "3" "shamrock corp said"
I would recommend to create a document term matrix (DTM). You will probably need this in your downstream tasks anyway. From that you could also extract the information you want, although, it is probably not reasonable to assume that a term (incl. ngrams) only has a single document where its coming from (at least this is what I understood from your question, please correct me if I am wrong). Therefore, I guess that in practice one term will have several documents associated with it - this kind of information is usually stored in a DTM.
An example with text2vec below. If you could elaborate further how you want to use your terms, etc. I could adapt the code according to your needs.
library(text2vec)
# I have set up two text do not overlap in any term just as an example
# in practice, this probably never happens
docs = c(d1 = c("here a text"), d2 = c("and another one"))
it = itoken(docs, tokenizer = word_tokenizer, progressbar = F)
v = create_vocabulary(it, ngram = c(1,3))
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
as.matrix(dtm)
# a a_text and and_another and_another_one another another_one here here_a here_a_text one text
# d1 1 1 0 0 0 0 0 1 1 1 0 1
# d2 0 0 1 1 1 1 1 0 0 0 1 0
library(stringi)
docs = c(d1 = c("here a text"), d2 = c("and another one"))
it = itoken(docs, tokenizer = word_tokenizer, progressbar = F)
v = create_vocabulary(it, ngram = c(1,3))
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
for (d in rownames(dtm)) {
v = dtm[d, ]
v = v[v!=0]
v = data.frame(number = 1:length(v)
,term = names(v))
v$n = stri_count_fixed(v$term, "_")+1
write.csv(v, file = paste0("v_", d, ".csv"), row.names = F)
}
read.csv("v_d1.csv")
# number term n
# 1 1 a 1
# 2 2 a_text 2
# 3 3 here 1
# 4 4 here_a 2
# 5 5 here_a_text 3
# 6 6 text 1
read.csv("v_d2.csv")
# number term n
# 1 1 and 1
# 2 2 and_another 2
# 3 3 and_another_one 3
# 4 4 another 1
# 5 5 another_one 2
# 6 6 one 1
I came across this snippet online while researching for ways to test that a number is a perfect square:
if n is a perfect square then it is congruent to 0, 1 or 2 mod 4
I tried translating it to ruby code with a ternary statement (see below) but it doesn't work correctly.
n % 4 == 0 && n % 4 == 1 && n % 4 == 2 ? true : false
How would you approach this?
Your snipped says "... is congruent to 0, 1 or 2 mod 4" but you're using &&. Changing that to || fixes your expression:
n % 4 == 0 || n % 4 == 1 || n % 4 == 2 ? true : false
The ternary statement is redundant because you already have a boolean result. It's equivalent to:
n % 4 == 0 || n % 4 == 1 || n % 4 == 2
Furthermore n % 4 returns either 0, 1, 2 or 3 for any n, so it can also be expressed as:
n % 4 <= 2 # because 0, 1, 2 are less or equal to 2
n % 4 < 3 # because 0, 1, 2 are less than 3
n % 4 != 3 # because 0, 1, 2 are not 3
You can write this:
n % 4 <= 2
You can approach by the following way
n % 4 < = 2
Try this:
(0..2).include?(n % 4)
if counter % 2 == 1 I am trying to decode this line - it's a Rails project and I am trying to figure out what the % does in this if statement.
% is the modulo operator. The result of counter % 2 is the remainder of counter / 2.
n % 2 is often a good way of determining if a number n is even or odd. If n % 2 == 0, the number is even (because no remainder means that the number is evenly divisible by 2); if n % 2 == 1, the number is odd.
In answer to the question "What does the % symbol do or mean in Ruby?" It is:
1) The modulo binary operator (as has been mentioned)
17 % 10 #=> 7
2) The alternative string delimiter token
%Q{hello world} #=> "hello world"
%Q(hello world) #=> "hello world"
%Q[hello world] #=> "hello world"
%Q!hello world! #=> "hello world"
# i.e. choose your own bracket pair
%q(hello world) #=> 'hello world'
%x(pwd) #=> `pwd`
%r(.*) #=> /.*/
3) The string format operator (shorthand for Kernel::sprintf)
"05d" % 123 #=> "00123"
That's the modulo operator. It gives the remainder when counter is divided by 2.
For example:
3 % 2 == 1
2 % 2 == 0
Regardless of how it works, the modulo operator is probably not the best code for the purpose (even though we are not given much context). As Jörg mentioned in a comment, the expression if counter.odd? is probably the intent, and is more readable.
If this is view code and used to determine (for example) alternating row colors, then you may be able to do without the counter altogether by using the built-in Rails helper cycle(). For example, you can use cycle('odd','even') as a class name for a table row, eliminating the counter and surrounding if/then logic.
Another thought: if this is within an each block, you may be able to use each_with_index and eliminate the extraneous counter variable.
My refactoring $0.02.
Also keep in mind that, Ruby's definition of the modulo (%) operator differs from that of C and Java. In Ruby, -7%3 is 2. In C and Java, the result is -1 instead. In Ruby, the sign of the result (for % operator) is always the same as the sign of the second operand.
Its the modulo operator.
http://en.wikipedia.org/wiki/Modulo_operation
It is the modulo operator, which is a fancy way of saying it's the remainder operator.
So if you divided a number by two, and the integer remainder of that number is one, then you know the number was odd. Your example checks for odd numbers.
Often this is done to highlight odd number rows with a different background color, making it easier to read large lists of data.
This is a very basic question. % is the modulo opereator if counter % 2 == 1 results to true for every odd number and to false for every even number.
If you're learning ruby you should learn how to use irb, there you can try things out and perhaps answer the question yourself.
try to enter
100.times{|i| puts "#{i} % 2 == 1 #=> #{i % 2 == 1}"}
into your irb irb console and see the output, than it should be clear what % does.
And you really should take a look at the rails api documentation (1.9, 1.8.7, 1.8.7), there you would have found the answer two your question % (Fixnum) with a further link to a detailed description of divmod (Numeric):
Returns an array containing the quotient and modulus obtained by dividing num by aNumeric. > If q, r = x.divmod(y), then
q = floor(float(x)/float(y))
x = q*y + r
The quotient is rounded toward -infinity, as shown in the following table:
a | b | a.divmod(b) | a/b | a.modulo(b) | a.remainder(b)
------+-----+---------------+---------+-------------+---------------
13 | 4 | 3, 1 | 3 | 1 | 1
------+-----+---------------+---------+-------------+---------------
13 | -4 | -4, -3 | -3 | -3 | 1
------+-----+---------------+---------+-------------+---------------
-13 | 4 | -4, 3 | -4 | 3 | -1
------+-----+---------------+---------+-------------+---------------
-13 | -4 | 3, -1 | 3 | -1 | -1
------+-----+---------------+---------+-------------+---------------
11.5 | 4 | 2, 3.5 | 2.875 | 3.5 | 3.5
------+-----+---------------+---------+-------------+---------------
11.5 | -4 | -3, -0.5 | -2.875 | -0.5 | 3.5
------+-----+---------------+---------+-------------+---------------
-11.5 | 4 | -3, 0.5 | -2.875 | 0.5 | -3.5
------+-----+---------------+---------+-------------+---------------
-11.5 | -4 | 2 -3.5 | 2.875 | -3.5 | -3.5
Examples
11.divmod(3) #=> [3, 2]
11.divmod(-3) #=> [-4, -1]
11.divmod(3.5) #=> [3, 0.5]
(-11).divmod(3.5) #=> [-4, 3.0]
(11.5).divmod(3.5) #=> [3, 1.0]
To give a few ways to say it:
Modulo operator
Remainder operator
Modular residue
Strictly speaking, if a % b = c, c is the unique constant such that
a == c (mod b) and 0 <= c < b
Where x == y (mod m) iff x - y = km for some constant k.
This is equivalent to the remainder. By some well known theorem, we have that a = bk + c for some constant k, where c is the remainder, which gives us a - c = bk, which obviously implies a == c (mod b).
(Is there a way to use LaTeX on Stackoverflow?)
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc).
Return only words with an occurrence greater than X
Return only words with a length greater than Y
Ignore common terms like "and, is, the, etc"
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John")
Return results in a collection/array
Extra Credit
Keep Quoted Statements together, (ie. "They were 'too good to be true' apparently")Where 'too good to be true' would be the actual statement
Extra-Extra Credit
Can your script determine words that should be kept together based upon their frequency of being found together? This being done without knowing the words beforehand. Example:
*"The fruit fly is a great thing when it comes to medical research. Much study has been done on the fruit fly in the past, and has lead to many breakthroughs. In the future, the fruit fly will continue to be studied, but our methods may change."*
Clearly the word here is "fruit fly," which is easy for us to find. Can your search'n'scrape script determine this too?
Source text: http://sampsonresume.com/labs/c.txt
Answer Format
It would be great to see the results of your code, output, in addition to how long the operation lasted.
GNU scripting
sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | sort -nr
Results:
7 be
6 to
[...]
1 2.
1 -
With occurence greater than X:
sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | awk '$1>X'
Return only words with a length greater than Y (put Y+1 dots in second grep):
sed -e 's/ /\n/g' | grep -v '^ *$' | grep .... | sort | uniq -c
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
sed -e 's/ /\n/g' | grep -v '^ *$' | grep -vf ignored | sort | uniq -c
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
sed -e 's/[,.:"\']//g;s/ /\n/g' | grep -v '^ *$' | sort | uniq -c
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
Perl in only 43 characters.
perl -MYAML -anE'$_{$_}++for#F;say Dump\%_'
Here is an example of it's use:
echo a a a b b c d e aa | perl -MYAML -anE'$_{$_}++for#F;say Dump \%_'
---
a: 3
aa: 1
b: 2
c: 1
d: 1
e: 1
If you need to list only the lowercase versions, it requires two more characters.
perl -MYAML -anE'$_{lc$_}++for#F;say Dump\%_'
For it to work on the specified text requires 58 characters.
curl http://sampsonresume.com/labs/c.txt |
perl -MYAML -F'\W+' -anE'$_{lc$_}++for#F;END{say Dump\%_}'
real 0m0.679s
user 0m0.304s
sys 0m0.084s
Here is the last example expanded a bit.
#! perl
use 5.010;
use YAML;
while( my $line = <> ){
for my $elem ( split '\W+', $line ){
$_{ lc $elem }++
}
END{
say Dump \%_;
}
}
F#: 304 chars
let f =
let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"]
fun length occurrence msg ->
System.Text.RegularExpressions.Regex.Split(msg, #"[^\w-']+")
|> Seq.countBy (fun a -> a)
|> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)
Ruby
When "minified", this implementation becomes 165 characters long. It uses array#inject to give a starting value (a Hash object with a default of 0) and then loop through the elements, which are then rolled into the hash; the result is then selected from the minimum frequency.
Note that I didn't count the size of the words to skip, that being an external constant. When the constant is counted too, the solution is 244 characters long.
Apostrophes and dashes aren't stripped, but included; their use modifies the word and therefore cannot be stripped simply without removal of all information beyond the symbol.
Implementation
CommonWords = %w(the a an but and is not or as of to in for by be may has can its it's)
def get_keywords(text, minFreq=0, minLen=2)
text.scan(/(?:\b)[a-z'-]{#{minLen},}(?=\b)/i).
inject(Hash.new(0)) do |result,w|
w.downcase!
result[w] += 1 unless CommonWords.include?(w)
result
end.select { |k,n| n >= minFreq }
end
Test Rig
require 'net/http'
keywords = get_keywords(Net::HTTP.get('www.sampsonresume.com','/labs/c.txt'), 3)
keywords.sort.each { |name,count| puts "#{name} x #{count} times" }
Test Results
code x 4 times
declarations x 4 times
each x 3 times
execution x 3 times
expression x 4 times
function x 5 times
keywords x 3 times
language x 3 times
languages x 3 times
new x 3 times
operators x 4 times
programming x 3 times
statement x 7 times
statements x 4 times
such x 3 times
types x 3 times
variables x 3 times
which x 4 times
C# 3.0 (with LINQ)
Here's my solution. It makes use of some pretty nice features of LINQ/extension methods to keep the code short.
public static Dictionary<string, int> GetKeywords(string text, int minCount, int minLength)
{
var commonWords = new string[] { "and", "is", "the", "as", "of", "to", "or", "in",
"for", "by", "an", "be", "may", "has", "can", "its"};
var words = Regex.Replace(text.ToLower(), #"[,.?\/;:\(\)]", string.Empty).Split(' ');
var occurrences = words.Distinct().Except(commonWords).Select(w =>
new { Word = w, Count = words.Count(s => s == w) });
return occurrences.Where(wo => wo.Count >= minCount && wo.Word.Length >= minLength)
.ToDictionary(wo => wo.Word, wo => wo.Count);
}
This is however far from the most efficient method, being O(n^2) with the number of words, rather than O(n), which is optimal in this case I believe. I'll see if I can creater a slightly longer method that is more efficient.
Here are the results of the function run on the sample text (min occurences: 3, min length: 2).
3 x such
4 x code
4 x which
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
7 x statement
3 x language
3 x expression
3 x execution
3 x programming
4 x operators
3 x variables
And my test program:
static void Main(string[] args)
{
string sampleText;
using (var client = new WebClient())
sampleText = client.DownloadString("http://sampsonresume.com/labs/c.txt");
var keywords = GetKeywords(sampleText, 3, 2);
foreach (var entry in keywords)
Console.WriteLine("{0} x {1}", entry.Value.ToString().PadLeft(3), entry.Key);
Console.ReadKey(true);
}
#! perl
use strict;
use warnings;
while (<>) {
for my $word (split) {
$words{$word}++;
}
}
for my $word (keys %words) {
print "$word occurred $words{$word} times.";
}
That's the simple form. If you want sorting, filtering, etc.:
while (<>) {
for my $word (split) {
$words{$word}++;
}
}
for my $word (keys %words) {
if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
print "$word occurred $words{$word} times.";
}
}
You can also sort the output pretty easily:
...
for my $word (keys %words) {
if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
push #output, "$word occurred $words{$word} times.";
}
}
$re = qr/occurred (\d+) /;
print sort {
$a = $a =~ $re;
$b = $b =~ $re;
$a <=> $b
} #output;
A true Perl hacker will easily get these on one or two lines each, but I went for readability.
Edit: this is how I would rewrite this last example
...
for my $word (
sort { $words{$a} <=> $words{$b} } keys %words
){
next unless length($word) >= $MINLEN;
last unless $words{$word) >= $MIN_OCCURRENCE;
print "$word occurred $words{$word} times.";
}
Or if I needed it to run faster I might even write it like this:
for my $word_data (
sort {
$a->[1] <=> $b->[1] # numerical sort on count
} grep {
# remove values that are out of bounds
length($_->[0]) >= $MINLEN && # word length
$_->[1] >= $MIN_OCCURRENCE # count
} map {
# [ word, count ]
[ $_, $words{$_} ]
} keys %words
){
my( $word, $count ) = #$word_data;
print "$word occurred $count times.";
}
It uses map for efficiency,
grep to remove extra elements,
and sort to do the sorting, of course. ( it does so it in that order )
This is a slight variant of the Schwartzian transform.
Another Python solution, at 247 chars. The actual code is a single line of highly dense Python line of 134 chars that computes the whole thing in a single expression.
x=3;y=2;W="and is the as of to or in for by an be may has can its".split()
from itertools import groupby as gb
d=dict((w,l)for w,l in((w,len(list(g)))for w,g in
gb(sorted(open("c.txt").read().lower().split())))
if l>x and len(w)>y and w not in W)
A much longer version with plenty of comments for you reading pleasure:
# High and low count boundaries.
x = 3
y = 2
# Common words string split into a list by spaces.
Words = "and is the as of to or in for by an be may has can its".split()
# A special function that groups similar strings in a list into a
# (string, grouper) pairs. Grouper is a generator of occurences (see below).
from itertools import groupby
# Reads the entire file, converts it to lower case and splits on whitespace
# to create a list of words
sortedWords = sorted(open("c.txt").read().lower().split())
# Using the groupby function, groups similar words together.
# Since grouper is a generator of occurences we need to use len(list(grouper))
# to get the word count by first converting the generator to a list and then
# getting the length of the list.
wordCounts = ((word, len(list(grouper))) for word, grouper in groupby(sortedWords))
# Filters the words by number of occurences and common words using yet another
# list comprehension.
filteredWordCounts = ((word, count) for word, count in wordCounts if word not in Words and count > x and len(word) > y)
# Creates a dictionary from the list of tuples.
result = dict(filteredWordCounts)
print result
The main trick here is using the itertools.groupby function to count the occurrences on a sorted list. Don't know if it really saves characters, but it does allow all the processing to happen in a single expression.
Results:
{'function': 4, 'operators': 4, 'declarations': 4, 'which': 4, 'statement': 5}
C# code:
IEnumerable<KeyValuePair<String, Int32>> ProcessText(String text, int X, int Y)
{
// common words, that will be ignored
var exclude = new string[] { "and", "is", "the", "as", "of", "to", "or", "in", "for", "by", "an", "be", "may", "has", "can", "its" }.ToDictionary(word => word);
// regular expression to find quoted text
var regex = new Regex("\"[^\"]\"", RegexOptions.Compiled);
return
// remove quoted text (it will be processed later)
regex.Replace(text, "")
// remove case dependency
.ToLower()
// split text by all these chars
.Split(".,'\\/[]{}()`~##$%^&*-=+?!;:<>| \n\r".ToCharArray())
// add quoted text
.Concat(regex.Matches(text).Cast<Match>().Select(match => match.Value))
// group words by the word and count them
.GroupBy(word => word, (word, words) => new KeyValuePair<String, Int32>(word, words.Count()))
// apply filter(min word count and word length) and remove common words
.Where(pair => pair.Value >= X && pair.Key.Length >= Y && !exclude.ContainsKey(pair.Key));
}
Output for ProcessText(text, 3, 2) call:
3 x languages
3 x such
4 x code
4 x which
3 x based
3 x each
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
3 x variables
7 x statement
4 x expression
3 x execution
3 x programming
3 x operators
In C#:
Use LINQ, specifically groupby, then filter by group count, and return a flattened (selectmany) list.
Use LINQ, filter by length.
Use LINQ, filter with 'badwords'.Contains.
REBOL
Verbose, perhaps, so definitely not a winner, but gets the job done.
min-length: 0
min-count: 0
common-words: [ "a" "an" "as" "and" "are" "by" "for" "from" "in" "is" "it" "its" "the" "of" "or" "to" "until" ]
add-word: func [
word [string!]
/local
count
letter
non-letter
temp
rules
match
][
; Strip out punctuation
temp: copy {}
letter: charset [ #"a" - #"z" #"A" - #"Z" #" " ]
non-letter: complement letter
rules: [
some [
copy match letter (append temp match)
|
non-letter
]
]
parse/all word rules
word: temp
; If we end up with nothing, bail
if 0 == length? word [
exit
]
; Check length
if min-length > length? word [
exit
]
; Ignore common words
ignore:
if find common-words word [
exit
]
; OK, its good. Add it.
either found? count: select words word [
words/(word): count + 1
][
repend words [word 1]
]
]
rules: [
some [
{"}
copy word to {"} (add-word word)
{"}
|
copy word to { } (add-word word)
{ }
]
end
]
words: copy []
parse/all read %c.txt rules
result: copy []
foreach word words [
if string? word [
count: words/:word
if count >= min-count [
append result word
]
]
]
sort result
foreach word result [ print word ]
The output is:
act
actions
all
allows
also
any
appear
arbitrary
arguments
assign
assigned
based
be
because
been
before
below
between
braces
branches
break
builtin
but
C
C like any other language has its blemishes Some of the operators have the wrong precedence some parts of the syntax could be better
call
called
calls
can
care
case
char
code
columnbased
comma
Comments
common
compiler
conditional
consisting
contain
contains
continue
control
controlflow
criticized
Cs
curly brackets
declarations
define
definitions
degree
delimiters
designated
directly
dowhile
each
effect
effects
either
enclosed
enclosing
end
entry
enum
evaluated
evaluation
evaluations
even
example
executed
execution
exert
expression
expressionExpressions
expressions
familiarity
file
followed
following
format
FORTRAN
freeform
function
functions
goto
has
high
However
identified
ifelse
imperative
include
including
initialization
innermost
int
integer
interleaved
Introduction
iterative
Kernighan
keywords
label
language
languages
languagesAlthough
leave
limit
lineEach
loop
looping
many
may
mimicked
modify
more
most
name
needed
new
next
nonstructured
normal
object
obtain
occur
often
omitted
on
operands
operator
operators
optimization
order
other
perhaps
permits
points
programmers
programming
provides
rather
reinitialization
reliable
requires
reserve
reserved
restrictions
results
return
Ritchie
say
scope
Sections
see
selects
semicolon
separate
sequence
sequence point
sequential
several
side
single
skip
sometimes
source
specify
statement
statements
storage
struct
Structured
structuresAs
such
supported
switch
syntax
testing
textlinebased
than
There
This
turn
type
types
union
Unlike
unspecified
use
used
uses
using
usually
value
values
variable
variables
variety
which
while
whitespace
widespread
will
within
writing
Python (258 chars as is, including 66 chars for first line and 30 chars for punctuation removal) :
W="and is the as of to or in for by an be may has can its".split()
x=3;y=2;d={}
for l in open('c.txt') :
for w in l.lower().translate(None,',.;\'"!()[]{}').split() :
if w not in W: d[w]=d.get(w,0)+1
for w,n in d.items() :
if n>y and len(w)>x : print n,w
output :
4 code
3 keywords
3 languages
3 execution
3 each
3 language
4 expression
4 statements
3 variables
7 statement
5 function
4 operators
4 declarations
3 programming
4 which
3 such
3 types
Here is my variant, in PHP:
$str = implode(file('c.txt'));
$tok = strtok($str, " .,;()\r\n\t");
$splitters = '\s.,\(\);?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );
foreach($array as $key) {
$res[$key] = $res[$key]+1;
}
$splitters = '\s.,\(\)\{\};?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );
foreach($array as $key) {
$res[$key] = $res[$key]+1;
}
unset($res['the']);
unset($res['and']);
unset($res['to']);
unset($res['of']);
unset($res['by']);
unset($res['a']);
unset($res['as']);
unset($res['is']);
unset($res['in']);
unset($res['']);
arsort($res);
//var_dump($res); // concordance
foreach ($res AS $word => $rarity)
echo $word . ' <b>x</b> ' . $rarity . '<br/>';
foreach ($array as $word) { // words longer than n (=5)
// if(strlen($word) > 5)echo $word.'<br/>';
}
And output:
statement x 7
be x 7
C x 5
may x 5
for x 5
or x 5
The x 5
as x 5
expression x 4
statements x 4
code x 4
function x 4
which x 4
an x 4
declarations x 3
new x 3
execution x 3
types x 3
such x 3
variables x 3
can x 3
languages x 3
operators x 3
end x 2
programming x 2
evaluated x 2
functions x 2
definitions x 2
keywords x 2
followed x 2
contain x 2
several x 2
side x 2
most x 2
has x 2
its x 2
called x 2
specify x 2
reinitialization x 2
use x 2
either x 2
each x 2
all x 2
built-in x 2
source x 2
are x 2
storage x 2
than x 2
effects x 1
including x 1
arguments x 1
order x 1
even x 1
unspecified x 1
evaluations x 1
operands x 1
interleaved x 1
However x 1
value x 1
branches x 1
goto x 1
directly x 1
designated x 1
label x 1
non-structured x 1
also x 1
enclosing x 1
innermost x 1
loop x 1
skip x 1
There x 1
within x 1
switch x 1
Expressions x 1
integer x 1
variety x 1
see x 1
below x 1
will x 1
on x 1
selects x 1
case x 1
executed x 1
based x 1
calls x 1
from x 1
because x 1
many x 1
widespread x 1
familiarity x 1
C's x 1
mimicked x 1
Although x 1
reliable x 1
obtain x 1
results x 1
needed x 1
other x 1
syntax x 1
often x 1
Introduction x 1
say x 1
Programming x 1
Language x 1
C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better. x 1
Ritchie x 1
Kernighan x 1
been x 1
criticized x 1
For x 1
example x 1
care x 1
more x 1
leave x 1
return x 1
call x 1
&& x 1
|| x 1
entry x 1
include x 1
next x 1
before x 1
sequence point x 1
sequence x 1
points x 1
comma x 1
operator x 1
but x 1
compiler x 1
requires x 1
programmers x 1
exert x 1
optimization x 1
object x 1
This x 1
permits x 1
high x 1
degree x 1
occur x 1
Structured x 1
using x 1
struct x 1
union x 1
enum x 1
define x 1
Declarations x 1
file x 1
contains x 1
Function x 1
turn x 1
assign x 1
perhaps x 1
Keywords x 1
char x 1
int x 1
Sections x 1
name x 1
variable x 1
reserve x 1
usually x 1
writing x 1
type x 1
Each x 1
line x 1
format x 1
rather x 1
column-based x 1
text-line-based x 1
whitespace x 1
arbitrary x 1
FORTRAN x 1
77 x 1
free-form x 1
allows x 1
restrictions x 1
Comments x 1
C99 x 1
following x 1
// x 1
until x 1
*/ x 1
/* x 1
appear x 1
between x 1
delimiters x 1
enclosed x 1
braces x 1
supported x 1
if x 1
-else x 1
conditional x 1
Unlike x 1
reserved x 1
sequential x 1
provides x 1
control-flow x 1
identified x 1
do-while x 1
while x 1
any x 1
omitted x 1
break x 1
continue x 1
expressions x 1
testing x 1
iterative x 1
looping x 1
separate x 1
initialization x 1
normal x 1
modify x 1
control x 1
structures x 1
As x 1
imperative x 1
single x 1
act x 1
sometimes x 1
curly brackets x 1
limit x 1
scope x 1
language x 1
uses x 1
evaluation x 1
assigned x 1
values x 1
To x 1
effect x 1
semicolon x 1
actions x 1
common x 1
consisting x 1
used x 1
var_dump statement simply displays concordance. This variant preserves double-quoted expressions.
For supplied file this code finishes in 0.047 seconds. Though larger file will consume lots of memory (because of file function).
This is not going to win any golfing awards but it does keep quoted phrases together and takes into account stop words (and leverages CPAN modules Lingua::StopWords and Text::ParseWords).
In addition, I use to_S from Lingua::EN::Inflect::Number to count only the singular forms of words.
You might also want to look at Lingua::CollinsParser.
#!/usr/bin/perl
use strict; use warnings;
use Lingua::EN::Inflect::Number qw( to_S );
use Lingua::StopWords qw( getStopWords );
use Text::ParseWords;
my $stop = getStopWords('en');
my %words;
while ( my $line = <> ) {
chomp $line;
next unless $line =~ /\S/;
next unless my #words = parse_line(' ', 1, $line);
++ $words{to_S $_} for
grep { length and not $stop->{$_} }
map { s!^[[:punct:]]+!!; s![[:punct:]]+\z!!; lc }
#words;
}
print "=== only words appearing 4 or more times ===\n";
print "$_ : $words{$_}\n" for sort {
$words{$b} <=> $words{$a}
} grep { $words{$_} > 3 } keys %words;
print "=== only words that are 12 characters or longer ===\n";
print "$_ : $words{$_}\n" for sort {
$words{$b} <=> $words{$a}
} grep { 11 < length } keys %words;
Output:
=== only words appearing 4 or more times ===
statement : 11
function : 7
expression : 6
may : 5
code : 4
variable : 4
operator : 4
declaration : 4
c : 4
type : 4
=== only words that are 12 characters or longer ===
reinitialization : 2
control-flow : 1
sequence point : 1
optimization : 1
curly brackets : 1
text-line-based : 1
non-structured : 1
column-based : 1
initialization : 1
Here is a different approach for the Project Euler #1 solution:
+/~.(3*i.>.1000%3),5*i.>.1000%5
How to refactor it?
[:+/#~.#,3 5([*i.#>.#%~)]
usage example:
f =: [:+/#~.#,3 5([*i.#>.#%~)]
f 1000
or
+/~.,3 5([*i.#>.#%~)1000
%~ = 4 : 'y % x'
i.#>.#%~ = 4 : 'i. >. y % x'
[*i.#>.#%~ = 4 : 'x * i. >. y % x'
3 5([*i.#>.#%~)] = 3 : '3 5 * i. >. y % 3 5'
[:+/#~.#,3 5([*i.#>.#%~)] = 3 : '+/ ~. , 3 5 * i. >. y % 3 5'
+/(#~ ( (0= 3| ]) +. (0 = 5 |]) )) 1+i.999
0 = ( 3 | ]) uses (twice) the trick of verb train (fork) with n u v (discussed at the end of http://www.jsoftware.com/help/learning/09.htm)
A different way of writing it:
+/(#~ ( ((0&=) # (3&|)) +. ((0&=) # (5&|)))) 1+i.999
Here is another approach, using a simple, generic verb
multiplesbelow =: 4 : 'I. 0 = x | i.y'
+/ ~. ,3 5 multiplesbelow"0 [ 1000