Programming concept - data-structures

I want to make a program that sort mail from junkmail using a point--system.
For some couple of words in the mail,
I want the program to give different points for each word that I have in my program categorized as "junkwords" where I also have assign different points for different words, so that each word is worth some amount of points.
My pseudocode:
Read text from file
Look for "junk words"
for each word that comes up give the point the word is worth.
If the total points for each junkword is 10 print "SPAM" followed by a list of words that were in the file and categorized as junkwords and their points.
Example (a textfile):
Hello!
Do you have trouble sleeping?
Do you need to rest?
Then dont hesitate call us for the absolute solution- without charge!
So when the programs run and analyzes the text above it should look like:
SPAM 14p
trouble 6p
charge 3p
solution 5p
So what I was planing to write was in this manners:
class junk(object):
fil = open("filnamne.txt","r")
junkwords = {"trouble":"6p","solution":"3p","virus":"4p"}
words = junkwords
if words in fil:
print("SPAM")
else:
print("The file doesn't contain any junk")
So my problem now is how do I give points for each word in my list that comes up in the file?
And how to I sum the total points so that if total_points are > 10 then the program should print "SPAM",
Followed by the list of the 'junkwords' that are found in the file and the total points of each word..

Here is a quick script that might get you close to there:
MAXPOINTS = 10
JUNKWORDS={"trouble":6,"solution":5,"charge":3,"virus":7}
fil = open("filnamne.txt", "r")
foundwords = {}
points = 0
for word in fil.read().split():
if word in JUNKWORDS:
if word not in foundwords:
foundwords[word] = 0
points += JUNKWORDS[word]
foundwords[word] += 1
if points > 10:
print "SPAM"
for word in foundwords:
print word, foundwords[word]*JUNKWORDS[word]
else:
print "The file doesn't contain any junk"
You may want to use .lower() on the words and make all your dictionary keys lowercase. Maybe also remove all non-alphanumeric characters.

Here's another approach:
from collections import Counter
word_points = {'trouble': 6, 'solution': 5, 'charge': 3, 'virus': 7}
words = []
with open('ham.txt') as f:
for line in f:
if line.strip(): # weed out empty lines
for word in line.split():
words.append(word)
count_of_words = Counter(words)
total_points = {}
for word in word_points:
if word in count_of_words:
total_points[word] = word_points[word] * count_of_words[word]
if sum(i[0] for i in total_points.iteritems()) > 10:
print 'SPAM {}'.format(sum(i[0] for i in total_points.iteritems()))
for i in total_points.iteritems():
print 'Word: {} Points: {}'.format(*i)
There are some optimizations you can do, but it should give you an idea of the general logic. Counter is available from Python 2.7 and above.

I have assumed that each word has different points, so I have used a dictionary.
You need to find the number of times a word in words has come in the file.
You should store the point for each word as an integer. not as '6p' or '4p'
So, try this:
def find_junk(filename):
word_points = {"trouble":6,"solution":3,"charge":2,"virus":4}
word_count = {word:0 for word in word_points}
count = 0
found = []
with open(filename) as f:
for line in f:
line = line.lower()
for word in word_points:
c = line.count(word)
if c > 0:
count += c * word_points[word]
found.append(word)
word_count[word] += c
if count >= 10:
print ' SPAM'*4
for word in found:
print '%10s%3s%3s' % (word, word_points[word], word_count[word])
else:
print "Not spam"
find_junk('spam.txt')

Related

Ruby splitting string into different files

Here I've created an algorithm that extracts an array of the Federalist papers and splits them up saving them into separate files titled "Federalist No." followed by their respective numbers. Everything works perfectly and the files are being created beautifully; however, the only problem I run into now is that it fails to create the last output.
Maybe it's because I've been staring at this for too many hours but I'm at an impasse.
I've inserted the line puts fedSections.length to see what the output is.
Using a smaller version of the compilation of the Fed papers for testing, the terminal output is 3... it creates "Federalist No. 0" a blank document to take into account empty space and "Federalist No. 1" with the first federalist paper. No "Federalist No. 2."
Any thoughts?
# Create new string to add array l to
fedString = " "
for f in 0...l.length-1
fedString += l[f] + ''
end
# Create variables applied to new files
Federalist_No= "Federalist No."
a = "0"
b = "FEDERALIST No."
fedSections = Array.new() # New array to insert Federalist paper to
fedSections = fedString.split("FEDERALIST No.") # Split string into elements of the array at each change in Federalist paper
puts fedSections.length
# Split gives empty string, off by one
for k in 0...fedSections.length-1 # Use of loop to write each Fed paper to its own file
new_text = File.open(Federalist_No + a + ".txt", "w") # Open said file with write capabilities
new_text.puts(b+a) # Write the "FEDERALIST No" and the number from "a"
new_text.puts fedSections[k] # Write contents of string (section of paper) to a file
new_text.close()
a = a.to_i + 1 # Increment "a" by one to accomodate for consecutive papers
a = a.to_s # Restore to string
end
The error is in your for loop
for k in 0...fedSections.length-1
you actually want
for k in 0..fedSections.length-1
... does not include the last element in the range
but as screenmutt said, it is more idiomatic ruby to use an each loop
fedSections.each do |section|

copy the lines of a file into hashmap in ruby

I have a file with multiple lines. In each line, there two words and a number, split by a comma - for example a, b, 1. It means that string A and string B have the key as 1. I wrote the below piece of code
File.open(ARGV[0], 'r') do |f1|
while line = f1.gets
puts line
end
end
i'm looking for an idea of how to split and copy the characters and number in such a way that the first two words have the last number as key in the hashmap.
Does this work for you?
hash = {}
File.readlines(ARGV[0]).each do |line|
var = line.gsub(' ','').split(',')
hash[var[2]] = var[0],var[1]
end
This would give:
hash['1'] = ['a','b']
I don't know if you want to store number one as an integer or a string, if it's a integer you're looking for, just do var[2].to_i before storing.
Modified your code a little bit, i think it's shorter this way, if i'm in any way wrong, do let me know.

Find and print lines in a file exactly matching string or regexp (Ruby)

In ruby 1.9.3, I'm trying to write a program that will find all words with n number of characters taken from an arbitrary set of characters. So for instance, if I'm given the characters [ b, a, h, s, v, i, e, y, k, s, a ] and n = 5, I need to find all 5-letter words that can be made using only those characters. Using the 2of4brif.txt word list from http://wordlist.sourceforge.net/ (to include British words and spellings, too), I have attempted the following code:
a = %w[b a h s v i e y k s a]
a.permutation(5).map(&:join).each do |x|
File.open('2of4brif.txt').each_line do |line|
puts line if line.match(/^[#{x}]+$/)
end
end
This does nothing (no error message, no output, as if frozen). I have also attempted variations based on the following threads:
What's the best way to search for a string in a file?
Ruby find string in file and print result
How to search for exact matching string in a text file using Ruby?
Finding lines in a text file matching a regular expression
Match a content with regexp in a file?
How to open a file and search for a word?
Every variation I have tried has resulted in either:
1) Freezing;
2) Printing all words from the list that contain the 5-character permutations (I assume that's what it's doing; I didn't go through and check all of the thousands of printed words); or
3) Printing all 5-character permutations found within words in the list (again, I assume that's what it's doing).
Again, I'm not looking for words that contain the 5-character permutations, I'm looking for 5-character permutations that are complete words in and of themselves, so a line in the text file should only be printed if it is a perfect match with a permutation.
What am I doing wrong? Thanks in advance!
You’re not really using regular expressions here. Your program is very inefficient, not only because you’re re-opening the file for each single permutation as has been pointed out (and there are 55k of them!); but above all because all you want to do is
/^[bahsvieyksa]{5}$/
for each line of the file.
I would thus suggest:
File.open('2of4brif.txt').each_line do |line|
puts line if line.match(/^[bahsvieyksa]{5}$/)
end
as a much more efficient alternative
This works for me using the english.0 file on that page (sorry, I couldn't find the specific file you mentioned):
a = %w[b a h s v i e y k s a l d n]
dict = {}
a.permutation(5).each do |p|
dict[p.join('')] = true
end
File.open('english.0').each_line do |line|
line.chomp!.downcase!
puts line if dict[line]
end
The structure should be pretty clear - I build the dictionary of permutations up front in one giant hash (you may need to revisit this depending on input sizes, but memory is cheap these days), and then I used the fact that the input was "one word per line" to simply key into that hash.
Also note, in my version, I read through the file only once. In yours you scan the file once per permutation, and there are thousands of permutations.
Simpler is to just count the occurrence of each char and compare:
a = %w[b a h s v i e y k s a l d n]
File.read('2of4brif.txt').split("\n").each do |line|
puts line if line.size == 5 && line.chars.all?{|x| line.count(x) <= a.count(x)}
end
For me the following worked out
File.open('file.txt').each_line do |line|
puts line if line[/<regexp>/]
end

How does this Ruby app know to select the middle third of sentences?

I am currently following Beginning Ruby by Peter Cooper and have put together my first app, a text analyzer. However, whilst I understand all of the concepts and the way in which they work, I can't for the life of me understand how the app knows to select the middle third of sentences sorted by length from this line:
ideal_sentances = sentences_sorted.slice(one_third, one_third + 1)
I have included the whole app for context any help is much appreciated as so far everything is making sense.
#analyzer.rb --Text Analyzer
stopwords = %w{the a by on for of are with just but and to the my I has some in do}
lines = File.readlines(ARGV[0])
line_count = lines.size
text = lines.join
#Count the characters
character_count = text.length
character_count_nospaces = text.gsub(/\s+/, '').length
#Count the words, sentances, and paragraphs
word_count = text.split.length
paragraph_count = text.split(/\n\n/).length
sentence_count = text.split(/\.|\?|!/).length
#Make a list of words in the text that aren't stop words,
#count them, and work out the percentage of non-stop words
#against all words
all_words = text.scan(/\w+/)
good_words = all_words.select {|word| !stopwords.include?(word)}
good_percentage = ((good_words.length.to_f / all_words.length.to_f)*100).to_i
#Summarize the text by cherry picking some choice sentances
sentances = text.gsub(/\s+/, ' ').strip.split(/\.|\?|!/)
sentances_sorted = sentences.sort_by { |sentence| sentance.length }
one_third = sentences_sorted.length / 3
ideal_sentances = sentences_sorted.slice(one_third, one_third + 1)
ideal_sentances = ideal_sentences.select{ |sentence| sentence =~ /is|are/ }
#Give analysis back to user
puts "#{line_count} lines"
puts "#{character_count} characters"
puts "#{character_count_nospaces} characters excluding spaces"
puts "#{word_count} words"
puts "#{paragraph_count} paragraphs"
puts "#{sentence_count} sentences"
puts "#{sentence_count / paragraph_count} sentences per paragraph (average)"
puts "#{word_count / sentence_count} words per sentence (average)"
puts "#{good_percentage}% of words are non-fluff words"
puts "Summary:\n\n" + ideal_sentences.join(". ")
puts "-- End of analysis."
Obviously I am a beginner so plain English would help enormously, cheers.
It gets a third of the length of the sentence with one_third = sentences_sorted.length / 3 then the line you posted ideal_sentances = sentences_sorted.slice(one_third, one_third + 1) says "grab a slice of all the sentences starting at the index equal to 1/3rd and continue 1/3rd of the length +1".
Make sense?
The slice method in you look it up in the ruby API say this:
If passed two Fixnum objects, returns a substring starting at the
offset given by the first, and a length given by the second.
This means that if you have a sentence broken into three pieces
ONE | TWO | THREE
slice(1/3, 1/3+1)
will return the string starting at 1/3 from the beginning
| TWO | THREE (this is what you are looking at now)
then you return the string that is 1/3+1 distance from where you are, which gives you
| TWO |
sentences is a list of all sentences. sentances_sorted is that list sorted by sentence length, so the middle third will be the sentences with the most average length. slice() grabs that middle third of the list, starting from the position represented by one_third and counting one_third + 1 from that point.
Note that the correct spelling is 'sentence' and not 'sentance'. I mention this only because you have some code errors that result from spelling it inconsistently.
I was stuck on this when I first started too. In plain English, you have to realize that the slice method can take 2 parameters here.
The first is the index. The second is how long slice goes for.
So lets say you start off with 6 sentences.
one_third = 2
slice(one_third, one_third+1)
1/3 of 6 is 2.
1) here the 1/3 means you start at element 2 which is index[1]
2) then it goes on for 2 (6/3) more + 1 length, so a total of 3 spaces
so it is affecting indexes 1 to index 3

How can I do fuzzy substring matching in Ruby?

I found lots of links about fuzzy matching, comparing one string to another and seeing which gets the highest similarity score.
I have one very long string, which is a document, and a substring. The substring came from the original document, but has been converted several times, so weird artifacts might have been introduced, such as a space here, a dash there. The substring will match a section of the text in the original document 99% or more. I am not matching to see from which document this string is, I am trying to find the index in the document where the string starts.
If the string was identical because no random error was introduced, I would use document.index(substring), however this fails if there is even one character difference.
I thought the difference would be accounted for by removing all characters except a-z in both the string and the substring, compare, and then use the index I generated when compressing the string to translate the index in the compressed string to the index in the real document. This worked well where the difference was whitespace and punctuation, but as soon as one letter is different it failed.
The document is typically a few pages to a hundred pages, and the substring from a few sentences to a few pages.
You could try amatch. It's available as a ruby gem and, although I haven't worked with fuzzy logic for a long time, it looks to have what you need. The homepage for amatch is: https://github.com/flori/amatch.
Just bored and messing around with the idea, a completely non-optimized and untested hack of a solution follows:
include 'amatch'
module FuzzyFinder
def scanner( input )
out = [] unless block_given?
pos = 0
input.scan(/(\w+)(\W*)/) do |word, white|
startpos = pos
pos = word.length + white.length
if block_given?
yield startpos, word
else
out << [startpos, word]
end
end
end
def find( text, doc )
index = scanner(doc)
sstr = text.gsub(/\W/,'')
levenshtein = Amatch::Levensthtein.new(sstr)
minlen = sstr.length
maxndx = index.length
possibles = []
minscore = minlen*2
index.each_with_index do |x, i|
spos = x[0]
str = x[1]
si = i
while (str.length < minlen)
i += 1
break unless i < maxndx
str += index[i][1]
end
str = str.slice(0,minlen) if (str.length > minlen)
score = levenshtein.search(str)
if score < minscore
possibles = [spos]
minscore = score
elsif score == minscore
possibles << spos
end
end
[minscore, possibles]
end
end
Obviously there are numerous improvements possible and probably necessary! A few off the top:
Process the document once and store
the results, possibly in a database.
Determine a usable length of string
for an initial check, process
against that initial substring first
before trying to match the entire
fragment.
Following up on the previous,
precalculate starting fragments of
that length.
A simple one is fuzzy_match
require 'fuzzy_match'
FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus') #=> seamus
A more elaborated one (you wouldn't say it from this example though) is levenshein, which computes the number of differences.
require 'levenshtein'
Levenshtein.distance('test', 'test') # => 0
Levenshtein.distance('test', 'tent') # => 1
You should look at the StrikeAMatch implementation detailed here:
A better similarity ranking algorithm for variable length strings
Instead of relying on some kind of string distance (i.e. number of changes between two strings), this one looks at the character pairs patterns. The more character pairs occur in each string, the better the match. It has worked wonderfully for our application, where we search for mistyped/variable length headings in a plain text file.
There's also a gem which combines StrikeAMatch (an implementation of Dice's coefficient on character-level bigrams) and Levenshtein distance to find matches: https://github.com/seamusabshere/fuzzy_match
It depends on the artifacts that can end up in the substring. In the simpler case where they are not part of [a-z] you can use parse the substring and then use Regexp#match on the document:
document = 'Ulputat non nullandigna tortor dolessi illam sectem laor acipsus.'
substr = "tortor - dolessi _%&# +illam"
re = Regexp.new(substr.split(/[^a-z]/i).select{|e| !e.empty?}.join(".*"))
md = document.match re
puts document[md.begin(0) ... md.end(0)]
# => tortor dolessi illam
(Here, as we do not set any parenthesis in the Regexp, we use begin and end on the first (full match) element 0 of MatchData.
If you are only interested in the start position, you can use =~ operator:
start_pos = document =~ re
I have used none of them, but I found some libraries just by doing a search for 'diff' in rubygems.org. All of them can be installed by gem. You might want to try them. I myself is interested, so if you already know these or if you try them out, it would be helpful if you leave your comment.
diff
diff-lcs
differ
difflcs
pretty_diff
diffy
kronk
khtmldiff
gdiff
ruby_diff
tdiff
diffrenderer
diffplex
dbdiff
diff_dirs
rsyncdiff
wdiff
diff4all
davidtrogers-htmldiff
edouard-htmldiff
diff2xml
dirdiff
rrdiff
nokogiri-diff
pretty-diff
easy_diff
smartdiff

Resources