Code Golf: Quickly Build List of Keywords from Text, Including # of Instances - code-golf

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc).
Return only words with an occurrence greater than X
Return only words with a length greater than Y
Ignore common terms like "and, is, the, etc"
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John")
Return results in a collection/array
Extra Credit
Keep Quoted Statements together, (ie. "They were 'too good to be true' apparently")Where 'too good to be true' would be the actual statement
Extra-Extra Credit
Can your script determine words that should be kept together based upon their frequency of being found together? This being done without knowing the words beforehand. Example:
*"The fruit fly is a great thing when it comes to medical research. Much study has been done on the fruit fly in the past, and has lead to many breakthroughs. In the future, the fruit fly will continue to be studied, but our methods may change."*
Clearly the word here is "fruit fly," which is easy for us to find. Can your search'n'scrape script determine this too?
Source text: http://sampsonresume.com/labs/c.txt
Answer Format
It would be great to see the results of your code, output, in addition to how long the operation lasted.

GNU scripting
sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | sort -nr
Results:
7 be
6 to
[...]
1 2.
1 -
With occurence greater than X:
sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | awk '$1>X'
Return only words with a length greater than Y (put Y+1 dots in second grep):
sed -e 's/ /\n/g' | grep -v '^ *$' | grep .... | sort | uniq -c
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
sed -e 's/ /\n/g' | grep -v '^ *$' | grep -vf ignored | sort | uniq -c
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
sed -e 's/[,.:"\']//g;s/ /\n/g' | grep -v '^ *$' | sort | uniq -c
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

Perl in only 43 characters.
perl -MYAML -anE'$_{$_}++for#F;say Dump\%_'
Here is an example of it's use:
echo a a a b b c d e aa | perl -MYAML -anE'$_{$_}++for#F;say Dump \%_'
---
a: 3
aa: 1
b: 2
c: 1
d: 1
e: 1
If you need to list only the lowercase versions, it requires two more characters.
perl -MYAML -anE'$_{lc$_}++for#F;say Dump\%_'
For it to work on the specified text requires 58 characters.
curl http://sampsonresume.com/labs/c.txt |
perl -MYAML -F'\W+' -anE'$_{lc$_}++for#F;END{say Dump\%_}'
real 0m0.679s
user 0m0.304s
sys 0m0.084s
Here is the last example expanded a bit.
#! perl
use 5.010;
use YAML;
while( my $line = <> ){
for my $elem ( split '\W+', $line ){
$_{ lc $elem }++
}
END{
say Dump \%_;
}
}

F#: 304 chars
let f =
let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"]
fun length occurrence msg ->
System.Text.RegularExpressions.Regex.Split(msg, #"[^\w-']+")
|> Seq.countBy (fun a -> a)
|> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)

Ruby
When "minified", this implementation becomes 165 characters long. It uses array#inject to give a starting value (a Hash object with a default of 0) and then loop through the elements, which are then rolled into the hash; the result is then selected from the minimum frequency.
Note that I didn't count the size of the words to skip, that being an external constant. When the constant is counted too, the solution is 244 characters long.
Apostrophes and dashes aren't stripped, but included; their use modifies the word and therefore cannot be stripped simply without removal of all information beyond the symbol.
Implementation
CommonWords = %w(the a an but and is not or as of to in for by be may has can its it's)
def get_keywords(text, minFreq=0, minLen=2)
text.scan(/(?:\b)[a-z'-]{#{minLen},}(?=\b)/i).
inject(Hash.new(0)) do |result,w|
w.downcase!
result[w] += 1 unless CommonWords.include?(w)
result
end.select { |k,n| n >= minFreq }
end
Test Rig
require 'net/http'
keywords = get_keywords(Net::HTTP.get('www.sampsonresume.com','/labs/c.txt'), 3)
keywords.sort.each { |name,count| puts "#{name} x #{count} times" }
Test Results
code x 4 times
declarations x 4 times
each x 3 times
execution x 3 times
expression x 4 times
function x 5 times
keywords x 3 times
language x 3 times
languages x 3 times
new x 3 times
operators x 4 times
programming x 3 times
statement x 7 times
statements x 4 times
such x 3 times
types x 3 times
variables x 3 times
which x 4 times

C# 3.0 (with LINQ)
Here's my solution. It makes use of some pretty nice features of LINQ/extension methods to keep the code short.
public static Dictionary<string, int> GetKeywords(string text, int minCount, int minLength)
{
var commonWords = new string[] { "and", "is", "the", "as", "of", "to", "or", "in",
"for", "by", "an", "be", "may", "has", "can", "its"};
var words = Regex.Replace(text.ToLower(), #"[,.?\/;:\(\)]", string.Empty).Split(' ');
var occurrences = words.Distinct().Except(commonWords).Select(w =>
new { Word = w, Count = words.Count(s => s == w) });
return occurrences.Where(wo => wo.Count >= minCount && wo.Word.Length >= minLength)
.ToDictionary(wo => wo.Word, wo => wo.Count);
}
This is however far from the most efficient method, being O(n^2) with the number of words, rather than O(n), which is optimal in this case I believe. I'll see if I can creater a slightly longer method that is more efficient.
Here are the results of the function run on the sample text (min occurences: 3, min length: 2).
3 x such
4 x code
4 x which
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
7 x statement
3 x language
3 x expression
3 x execution
3 x programming
4 x operators
3 x variables
And my test program:
static void Main(string[] args)
{
string sampleText;
using (var client = new WebClient())
sampleText = client.DownloadString("http://sampsonresume.com/labs/c.txt");
var keywords = GetKeywords(sampleText, 3, 2);
foreach (var entry in keywords)
Console.WriteLine("{0} x {1}", entry.Value.ToString().PadLeft(3), entry.Key);
Console.ReadKey(true);
}

#! perl
use strict;
use warnings;
while (<>) {
for my $word (split) {
$words{$word}++;
}
}
for my $word (keys %words) {
print "$word occurred $words{$word} times.";
}
That's the simple form. If you want sorting, filtering, etc.:
while (<>) {
for my $word (split) {
$words{$word}++;
}
}
for my $word (keys %words) {
if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
print "$word occurred $words{$word} times.";
}
}
You can also sort the output pretty easily:
...
for my $word (keys %words) {
if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
push #output, "$word occurred $words{$word} times.";
}
}
$re = qr/occurred (\d+) /;
print sort {
$a = $a =~ $re;
$b = $b =~ $re;
$a <=> $b
} #output;
A true Perl hacker will easily get these on one or two lines each, but I went for readability.
Edit: this is how I would rewrite this last example
...
for my $word (
sort { $words{$a} <=> $words{$b} } keys %words
){
next unless length($word) >= $MINLEN;
last unless $words{$word) >= $MIN_OCCURRENCE;
print "$word occurred $words{$word} times.";
}
Or if I needed it to run faster I might even write it like this:
for my $word_data (
sort {
$a->[1] <=> $b->[1] # numerical sort on count
} grep {
# remove values that are out of bounds
length($_->[0]) >= $MINLEN && # word length
$_->[1] >= $MIN_OCCURRENCE # count
} map {
# [ word, count ]
[ $_, $words{$_} ]
} keys %words
){
my( $word, $count ) = #$word_data;
print "$word occurred $count times.";
}
It uses map for efficiency,
grep to remove extra elements,
and sort to do the sorting, of course. ( it does so it in that order )
This is a slight variant of the Schwartzian transform.

Another Python solution, at 247 chars. The actual code is a single line of highly dense Python line of 134 chars that computes the whole thing in a single expression.
x=3;y=2;W="and is the as of to or in for by an be may has can its".split()
from itertools import groupby as gb
d=dict((w,l)for w,l in((w,len(list(g)))for w,g in
gb(sorted(open("c.txt").read().lower().split())))
if l>x and len(w)>y and w not in W)
A much longer version with plenty of comments for you reading pleasure:
# High and low count boundaries.
x = 3
y = 2
# Common words string split into a list by spaces.
Words = "and is the as of to or in for by an be may has can its".split()
# A special function that groups similar strings in a list into a
# (string, grouper) pairs. Grouper is a generator of occurences (see below).
from itertools import groupby
# Reads the entire file, converts it to lower case and splits on whitespace
# to create a list of words
sortedWords = sorted(open("c.txt").read().lower().split())
# Using the groupby function, groups similar words together.
# Since grouper is a generator of occurences we need to use len(list(grouper))
# to get the word count by first converting the generator to a list and then
# getting the length of the list.
wordCounts = ((word, len(list(grouper))) for word, grouper in groupby(sortedWords))
# Filters the words by number of occurences and common words using yet another
# list comprehension.
filteredWordCounts = ((word, count) for word, count in wordCounts if word not in Words and count > x and len(word) > y)
# Creates a dictionary from the list of tuples.
result = dict(filteredWordCounts)
print result
The main trick here is using the itertools.groupby function to count the occurrences on a sorted list. Don't know if it really saves characters, but it does allow all the processing to happen in a single expression.
Results:
{'function': 4, 'operators': 4, 'declarations': 4, 'which': 4, 'statement': 5}

C# code:
IEnumerable<KeyValuePair<String, Int32>> ProcessText(String text, int X, int Y)
{
// common words, that will be ignored
var exclude = new string[] { "and", "is", "the", "as", "of", "to", "or", "in", "for", "by", "an", "be", "may", "has", "can", "its" }.ToDictionary(word => word);
// regular expression to find quoted text
var regex = new Regex("\"[^\"]\"", RegexOptions.Compiled);
return
// remove quoted text (it will be processed later)
regex.Replace(text, "")
// remove case dependency
.ToLower()
// split text by all these chars
.Split(".,'\\/[]{}()`~##$%^&*-=+?!;:<>| \n\r".ToCharArray())
// add quoted text
.Concat(regex.Matches(text).Cast<Match>().Select(match => match.Value))
// group words by the word and count them
.GroupBy(word => word, (word, words) => new KeyValuePair<String, Int32>(word, words.Count()))
// apply filter(min word count and word length) and remove common words
.Where(pair => pair.Value >= X && pair.Key.Length >= Y && !exclude.ContainsKey(pair.Key));
}
Output for ProcessText(text, 3, 2) call:
3 x languages
3 x such
4 x code
4 x which
3 x based
3 x each
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
3 x variables
7 x statement
4 x expression
3 x execution
3 x programming
3 x operators

In C#:
Use LINQ, specifically groupby, then filter by group count, and return a flattened (selectmany) list.
Use LINQ, filter by length.
Use LINQ, filter with 'badwords'.Contains.

REBOL
Verbose, perhaps, so definitely not a winner, but gets the job done.
min-length: 0
min-count: 0
common-words: [ "a" "an" "as" "and" "are" "by" "for" "from" "in" "is" "it" "its" "the" "of" "or" "to" "until" ]
add-word: func [
word [string!]
/local
count
letter
non-letter
temp
rules
match
][
; Strip out punctuation
temp: copy {}
letter: charset [ #"a" - #"z" #"A" - #"Z" #" " ]
non-letter: complement letter
rules: [
some [
copy match letter (append temp match)
|
non-letter
]
]
parse/all word rules
word: temp
; If we end up with nothing, bail
if 0 == length? word [
exit
]
; Check length
if min-length > length? word [
exit
]
; Ignore common words
ignore:
if find common-words word [
exit
]
; OK, its good. Add it.
either found? count: select words word [
words/(word): count + 1
][
repend words [word 1]
]
]
rules: [
some [
{"}
copy word to {"} (add-word word)
{"}
|
copy word to { } (add-word word)
{ }
]
end
]
words: copy []
parse/all read %c.txt rules
result: copy []
foreach word words [
if string? word [
count: words/:word
if count >= min-count [
append result word
]
]
]
sort result
foreach word result [ print word ]
The output is:
act
actions
all
allows
also
any
appear
arbitrary
arguments
assign
assigned
based
be
because
been
before
below
between
braces
branches
break
builtin
but
C
C like any other language has its blemishes Some of the operators have the wrong precedence some parts of the syntax could be better
call
called
calls
can
care
case
char
code
columnbased
comma
Comments
common
compiler
conditional
consisting
contain
contains
continue
control
controlflow
criticized
Cs
curly brackets
declarations
define
definitions
degree
delimiters
designated
directly
dowhile
each
effect
effects
either
enclosed
enclosing
end
entry
enum
evaluated
evaluation
evaluations
even
example
executed
execution
exert
expression
expressionExpressions
expressions
familiarity
file
followed
following
format
FORTRAN
freeform
function
functions
goto
has
high
However
identified
ifelse
imperative
include
including
initialization
innermost
int
integer
interleaved
Introduction
iterative
Kernighan
keywords
label
language
languages
languagesAlthough
leave
limit
lineEach
loop
looping
many
may
mimicked
modify
more
most
name
needed
new
next
nonstructured
normal
object
obtain
occur
often
omitted
on
operands
operator
operators
optimization
order
other
perhaps
permits
points
programmers
programming
provides
rather
reinitialization
reliable
requires
reserve
reserved
restrictions
results
return
Ritchie
say
scope
Sections
see
selects
semicolon
separate
sequence
sequence point
sequential
several
side
single
skip
sometimes
source
specify
statement
statements
storage
struct
Structured
structuresAs
such
supported
switch
syntax
testing
textlinebased
than
There
This
turn
type
types
union
Unlike
unspecified
use
used
uses
using
usually
value
values
variable
variables
variety
which
while
whitespace
widespread
will
within
writing

Python (258 chars as is, including 66 chars for first line and 30 chars for punctuation removal) :
W="and is the as of to or in for by an be may has can its".split()
x=3;y=2;d={}
for l in open('c.txt') :
for w in l.lower().translate(None,',.;\'"!()[]{}').split() :
if w not in W: d[w]=d.get(w,0)+1
for w,n in d.items() :
if n>y and len(w)>x : print n,w
output :
4 code
3 keywords
3 languages
3 execution
3 each
3 language
4 expression
4 statements
3 variables
7 statement
5 function
4 operators
4 declarations
3 programming
4 which
3 such
3 types

Here is my variant, in PHP:
$str = implode(file('c.txt'));
$tok = strtok($str, " .,;()\r\n\t");
$splitters = '\s.,\(\);?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );
foreach($array as $key) {
$res[$key] = $res[$key]+1;
}
$splitters = '\s.,\(\)\{\};?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );
foreach($array as $key) {
$res[$key] = $res[$key]+1;
}
unset($res['the']);
unset($res['and']);
unset($res['to']);
unset($res['of']);
unset($res['by']);
unset($res['a']);
unset($res['as']);
unset($res['is']);
unset($res['in']);
unset($res['']);
arsort($res);
//var_dump($res); // concordance
foreach ($res AS $word => $rarity)
echo $word . ' <b>x</b> ' . $rarity . '<br/>';
foreach ($array as $word) { // words longer than n (=5)
// if(strlen($word) > 5)echo $word.'<br/>';
}
And output:
statement x 7
be x 7
C x 5
may x 5
for x 5
or x 5
The x 5
as x 5
expression x 4
statements x 4
code x 4
function x 4
which x 4
an x 4
declarations x 3
new x 3
execution x 3
types x 3
such x 3
variables x 3
can x 3
languages x 3
operators x 3
end x 2
programming x 2
evaluated x 2
functions x 2
definitions x 2
keywords x 2
followed x 2
contain x 2
several x 2
side x 2
most x 2
has x 2
its x 2
called x 2
specify x 2
reinitialization x 2
use x 2
either x 2
each x 2
all x 2
built-in x 2
source x 2
are x 2
storage x 2
than x 2
effects x 1
including x 1
arguments x 1
order x 1
even x 1
unspecified x 1
evaluations x 1
operands x 1
interleaved x 1
However x 1
value x 1
branches x 1
goto x 1
directly x 1
designated x 1
label x 1
non-structured x 1
also x 1
enclosing x 1
innermost x 1
loop x 1
skip x 1
There x 1
within x 1
switch x 1
Expressions x 1
integer x 1
variety x 1
see x 1
below x 1
will x 1
on x 1
selects x 1
case x 1
executed x 1
based x 1
calls x 1
from x 1
because x 1
many x 1
widespread x 1
familiarity x 1
C's x 1
mimicked x 1
Although x 1
reliable x 1
obtain x 1
results x 1
needed x 1
other x 1
syntax x 1
often x 1
Introduction x 1
say x 1
Programming x 1
Language x 1
C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better. x 1
Ritchie x 1
Kernighan x 1
been x 1
criticized x 1
For x 1
example x 1
care x 1
more x 1
leave x 1
return x 1
call x 1
&& x 1
|| x 1
entry x 1
include x 1
next x 1
before x 1
sequence point x 1
sequence x 1
points x 1
comma x 1
operator x 1
but x 1
compiler x 1
requires x 1
programmers x 1
exert x 1
optimization x 1
object x 1
This x 1
permits x 1
high x 1
degree x 1
occur x 1
Structured x 1
using x 1
struct x 1
union x 1
enum x 1
define x 1
Declarations x 1
file x 1
contains x 1
Function x 1
turn x 1
assign x 1
perhaps x 1
Keywords x 1
char x 1
int x 1
Sections x 1
name x 1
variable x 1
reserve x 1
usually x 1
writing x 1
type x 1
Each x 1
line x 1
format x 1
rather x 1
column-based x 1
text-line-based x 1
whitespace x 1
arbitrary x 1
FORTRAN x 1
77 x 1
free-form x 1
allows x 1
restrictions x 1
Comments x 1
C99 x 1
following x 1
// x 1
until x 1
*/ x 1
/* x 1
appear x 1
between x 1
delimiters x 1
enclosed x 1
braces x 1
supported x 1
if x 1
-else x 1
conditional x 1
Unlike x 1
reserved x 1
sequential x 1
provides x 1
control-flow x 1
identified x 1
do-while x 1
while x 1
any x 1
omitted x 1
break x 1
continue x 1
expressions x 1
testing x 1
iterative x 1
looping x 1
separate x 1
initialization x 1
normal x 1
modify x 1
control x 1
structures x 1
As x 1
imperative x 1
single x 1
act x 1
sometimes x 1
curly brackets x 1
limit x 1
scope x 1
language x 1
uses x 1
evaluation x 1
assigned x 1
values x 1
To x 1
effect x 1
semicolon x 1
actions x 1
common x 1
consisting x 1
used x 1
var_dump statement simply displays concordance. This variant preserves double-quoted expressions.
For supplied file this code finishes in 0.047 seconds. Though larger file will consume lots of memory (because of file function).

This is not going to win any golfing awards but it does keep quoted phrases together and takes into account stop words (and leverages CPAN modules Lingua::StopWords and Text::ParseWords).
In addition, I use to_S from Lingua::EN::Inflect::Number to count only the singular forms of words.
You might also want to look at Lingua::CollinsParser.
#!/usr/bin/perl
use strict; use warnings;
use Lingua::EN::Inflect::Number qw( to_S );
use Lingua::StopWords qw( getStopWords );
use Text::ParseWords;
my $stop = getStopWords('en');
my %words;
while ( my $line = <> ) {
chomp $line;
next unless $line =~ /\S/;
next unless my #words = parse_line(' ', 1, $line);
++ $words{to_S $_} for
grep { length and not $stop->{$_} }
map { s!^[[:punct:]]+!!; s![[:punct:]]+\z!!; lc }
#words;
}
print "=== only words appearing 4 or more times ===\n";
print "$_ : $words{$_}\n" for sort {
$words{$b} <=> $words{$a}
} grep { $words{$_} > 3 } keys %words;
print "=== only words that are 12 characters or longer ===\n";
print "$_ : $words{$_}\n" for sort {
$words{$b} <=> $words{$a}
} grep { 11 < length } keys %words;
Output:
=== only words appearing 4 or more times ===
statement : 11
function : 7
expression : 6
may : 5
code : 4
variable : 4
operator : 4
declaration : 4
c : 4
type : 4
=== only words that are 12 characters or longer ===
reinitialization : 2
control-flow : 1
sequence point : 1
optimization : 1
curly brackets : 1
text-line-based : 1
non-structured : 1
column-based : 1
initialization : 1

Related

Ruby: find multiples of 3 and 5 up to n. Can't figure out what's wrong with my code. Advice based on my code please

I have been attempting the test below on codewars. I am relatively new to coding and will look for more appropriate solutions as well as asking you for feedback on my code. I have written the solution at the bottom and for the life of me cannot understand what is missing as the resultant figure is always 0. I'd very much appreciate feedback on my code for the problem and not just giving your best solution to the problem. Although both would be much appreciated. Thank you in advance!
The test posed is:
If we list all the natural numbers below 10 that are multiples of 3 or
5, we get 3, 5, 6 and 9. The sum of these multiples is 23.
Finish the solution so that it returns the sum of all the multiples of
3 or 5 below the number passed in. Additionally, if the number is
negative, return 0 (for languages that do have them).
Note: If the number is a multiple of both 3 and 5, only count it once.
My code is as follows:
def solution(number)
array = [1..number]
multiples = []
if number < 0
return 0
else
array.each { |x|
if x % 3 == 0 || x % 5 == 0
multiples << x
end
}
end
return multiples.sum
end
In a situation like this, when something in your code produces an unexpected result you should debug it, meaning, run it line by line with the same argument and see what each variable holds. Using some kind of interactive console for running code (like irb) is very helpfull.
Moving to your example, let's start from the beginning:
number = 10
array = [1..number]
puts array.size # => 1 - wait what?
puts array[0].class # => Range
As you can see the array variable doesn't contain numbers but rather a Range object. After you finish filtering the array the result is an empty array that sums to 0.
Regardless of that, Ruby has a lot of built-in methods that can help you accomplish the same problem typing fewer words, for example:
multiples_of_3_and_5 = array.select { |number| number % 3 == 0 || number % 5 == 0 }
When writing a multiline block of code, prefer the do, end syntax, for example:
array.each do |x|
if x % 3 == 0 || x % 5 == 0
multiples << x
end
end
I'm not suggesting that this is the best approach per se, but using your specific code, you could fix the MAIN problem by editing the first line of your code in one of 2 ways:
By either converting your range to an array. Something like this would do the trick:
array = (1..number).to_a
or by just using a range INSTEAD of an array like so:
range = 1..number
The latter solution inserted into your code might look like this:
number = 17
range = 1..number
multiples = []
if number < 0
return 0
else range.each{|x|
if x % 3 == 0 || x % 5 == 0
multiples << x
end
}
end
multiples.sum
#=> 60
The statement return followed by end suggests that you were writing a method, but the def statement is missing. I believe that should be
def tot_sum(number, array)
multiples = []
if number < 0
return 0
else array.each{|x|
if x % 3 == 0 || x % 5 == 0
multiples << x
end
}
end
return multiples.sum
end
As you point out, however, this double-counts numbers that are multiples of 15.
Let me suggest a more efficient way of writing that. First consider the sum of numbers that are multiples of 3 that do not exceed a given number n.
Suppose
n = 3
m = 16
then the total of numbers that are multiples of three that do not exceed 16 can be computed as follows:
3 * 1 + 3 * 2 + 3 * 3 + 3 * 4 + 3 * 5
= 3 * (1 + 2 + 3 + 4 + 5)
= 3 * 5 * (1 + 5)/2
= 45
This makes use of the fact that 5 * (1 + 5)/2 equals the sum of an algebraic series: (1 + 2 + 3 + 4 + 5).
We may write a helper method to compute this sum for any number n, with m being the number that multiples of n cannot exceed:
def tot_sum(n, m)
p = m/n
n * p * (1 + p)/2
end
For example,
tot_sum(3, 16)
#=> 45
We may now write a method that gives the desired result (remembering that we need to account for the fact that multiples of 15 are multiples of both 3 and 5):
def tot(m)
tot_sum(3, m) + tot_sum(5, m) - tot_sum(15, m)
end
tot( 9) #=> 23
tot( 16) #=> 60
tot(9999) #=> 23331668

Duplicate Strings with Ambiguity

I have a large (5-10 million) set of strings with the restricted alphabet of nucleotide symbols (A,T,C, and G) along with a wildcard symbol N. Each string has an integer associated with it.
I want to find all the unique strings and, for each, sum their integer values. The 'representative' string for a set of equal strings should be the one with the highest integer value. For example, given:
NTG 9
NAG 6
ANG 5
TTT 2
ATG 2
I want the output to be:
NTG 14
NAG 6
ATG 2
TTT 2
With a dataset of this size pairwise comparisons are not feasible. Any ideas?
I assumed that your target output wasn't accurate. It seems more appropriate to match "ATG" to "ANG" (which I have done) instead of matching "ANG" to "NTG" (your stated goal). This solution addresses your given sample set, but may not be helpful for your desired application given the significant difference in scale.
Code:
import re
test = """
NTG 9
NAG 6
ANG 5
TTT 2
ATG 2
"""
test = [x.split(" ") for x in test.upper().split("\n") if x != ""]
#print(test)
index = 0
while index < len(test):
seq = test[index]
seq_regex = seq[0].replace("N", ".")
no_match_li = [x for x in test if len(re.findall(seq_regex, x[0])) == 0]
match_li = [int(x[1]) for x in test if len(re.findall(seq_regex, x[0])) != 0]
#print(no_match_li, match_li)
test = [[seq[0], sum(match_li)]] + no_match_li
index += 1
test = sorted(test, key=lambda x: x[1], reverse=True)
for seq in test:
print(seq[0], seq[1])
Output:
NTG 11
NAG 6
ANG 5
TTT 2

Ruby while and if loop issue

x = 16
while x != 1 do
if x % 2 == 0
x = x / 2
print "#{x} "
end
break if x < 0
end
Hi, the result I get from above is 8 4 2 1 . Is there any way to remove the space at the end?
One of Rubys main features is its beauty - you can shorten that loop to a nice one liner when using an array:
x = 16
arr = []
arr.push(x /= 2) while x.even?
puts arr.join(' ')
# => "8 4 2 1"
* As sagarpandya82 suggested x.even? is the same as using x % 2 == 0, leading to even more readable code
Don't print the values into the loop. Put them into a list (array) then, after the loop, join the array items using space as glue.
x = 16
a = []
while x != 1 do
if x % 2 == 0
x = x / 2
a << x
end
break if x < 0
end
puts '<' + a.join(' ') + '>'
The output is:
<8 4 2 1>
As #Bathsheba notes in a comment, this solution uses extra memory (the array) to store the values and also the call to Array#join generates a string that doubles the memory requirements. This is not an issue for small lists as the one in the question but needs to be considered the list becomes very large.
loop.reduce([[], 16]) do |(acc, val), _|
break acc if val <= 1
acc << val / 2 if val.even?
[acc, val / 2]
end.join ' '
if x != 0
print " "
end
is one way, having dropped the suffixed space from the other print. I/O will always be the performance bottleneck; an extra if will have a negligible effect on performance, and the extra print will merely contribute to the output stream which is normally buffered.

Multiple ifs in Smalltalk

I'm really new to smalltalk and still trying to figure out the basic stuff. Below is a simple program I wrote.
It is supposed to print "a" if the number can be divided by 5, "b" if it can be divided by 3, and "ab" if it can be divided by 5 and 3. In any other case, the program just prints the number itself.
It certainly works like this, but I feel that the code isn't very pretty - I would like to avoid the third "if", but I'm really not sure how.
How would you refactor this?
1 to: 100 do: [ :i |
(i % 5 == 0)
ifTrue: [ Transcript show: 'a' ].
(i % 3 == 0)
ifTrue: [ Transcript show: 'b' ].
((i % 3 == 0) or: (i % 5 == 0))
ifFalse: [ Transcript show: i ].
Transcript cr.
].
Thanks in advance for your help!
Smells like a Fizz Buzz problem! :-)
One approach in Smalltalk (Pharo) I've seen that I like is to use a dictionary with the Fizz and/or Buzz words as values and the booleans for whether it's divisible by 3 and 5 as keys. Once you have that, you simply look up the value for each index between 1 and 100. Oh, and don't bother dividing and checking whether the remainder is zero yourself - it's Smalltalk, so a number should know whether it's divisible by another number.
| fizzbuzz |
fizzbuzz := Dictionary
with: #(true true)->'FizzBuzz'
with: #(true false)->'Fizz'
with: #(false true)->'Buzz'.
1 to: 100 do: [ :eachIndex |
Transcript
show: (fizzbuzz
at: {eachIndex isDivisibleBy: 3. eachIndex isDivisibleBy: 5}
ifAbsent: [ eachIndex ]);
cr]
Have a look at some of the other examples as well, sometimes the different approaches can be quite educational. I'll leave it to you to adapt the code to your 'a'/'b'/'ab' example.
First of all, I would rewrite your version as:
1 to: 100 do: [:i |
i % 5 = 0 ifTrue: [Transcript show: 'a'].
i % 3 = 0 ifTrue: [Transcript show: 'b'].
(i % 3 = 0 or: [i % 5 = 0]) ifFalse: [Transcript show: i].
Transcript cr]
The changes are:
Use = instead of == (not a big deal)
Use or: [i % 5 = 0] with brackets
Another change you could introduce is
1 to: 100 do: [:i | | label |
i % 5 = 0 ifTrue: [label := 'a'].
i % 3 = 0 ifTrue: [label := 'b'].
label isNil ifTrue: [label := i].
Transcript show: label; cr]
Note that I'm not paying too much attention to the IFs but to the fact that Transcript show: appears three times in your code.
EDIT
Alas! My version above is not equivalent to yours because it will not print 'a' if the number is divided by 5 and 3!
EDIT 2
Here is how to reproduce the behavior of the original code:
1 to: 100 do: [:i | | label |
label := ''.
i % 5 = 0 ifTrue: [label := 'a'].
i % 3 = 0 ifTrue: [label := label , 'b'].
label isEmpty ifTrue: [label := i].
Transcript show: label; cr]

How to refactor this in J?

Here is a different approach for the Project Euler #1 solution:
+/~.(3*i.>.1000%3),5*i.>.1000%5
How to refactor it?
[:+/#~.#,3 5([*i.#>.#%~)]
usage example:
f =: [:+/#~.#,3 5([*i.#>.#%~)]
f 1000
or
+/~.,3 5([*i.#>.#%~)1000
%~ = 4 : 'y % x'
i.#>.#%~ = 4 : 'i. >. y % x'
[*i.#>.#%~ = 4 : 'x * i. >. y % x'
3 5([*i.#>.#%~)] = 3 : '3 5 * i. >. y % 3 5'
[:+/#~.#,3 5([*i.#>.#%~)] = 3 : '+/ ~. , 3 5 * i. >. y % 3 5'
+/(#~ ( (0= 3| ]) +. (0 = 5 |]) )) 1+i.999
0 = ( 3 | ]) uses (twice) the trick of verb train (fork) with n u v (discussed at the end of http://www.jsoftware.com/help/learning/09.htm)
A different way of writing it:
+/(#~ ( ((0&=) # (3&|)) +. ((0&=) # (5&|)))) 1+i.999
Here is another approach, using a simple, generic verb
multiplesbelow =: 4 : 'I. 0 = x | i.y'
+/ ~. ,3 5 multiplesbelow"0 [ 1000

Resources