Say I have an app that's trying to relate a string with an int. There are many strings and I want to keep a list of the top N that have occurred.
For example say the strings looked like this:
item 0 = "Foo"
item 1 = "Foo"
item 2 = "Boo"
item 3 = "Boo"
item 4 = "Bar"
item 5 = "Sar"
Say my cache has a cap of 3. Here's how I want it to behave:
item 0 = TryGet "Foo" - Add. "Foo" occurrences = 1
item 1 = TryGet "Foo" - return. "Foo" occurrences = 2
item 2 = TryGet "Boo" - Add. "Boo" occurrences = 1
item 3 = TryGet "Boo" - return. "Boo" occurrences = 2
item 4 = TryGet "Bar" - Add. "Bar" occurrences = 1
item 5 = TryGet "Sar" - At capacity. Remove elem with lowest occurrences, "Bar". Add "Sar"
So each current cache item gets a weight and on discovery of a new item when at capacity, we toss out the element with the lowest number of get occurrences. Is there a name for this kind of caching algorithm?
EDIT:
I was looking for Least-Frequently Used
https://en.wikipedia.org/wiki/Cache_replacement_policies#Least-Frequently_Used_.28LFU.29
I was looking for Least-Frequently Used
Related
I am trying to remove duplicates from array using for loop and conditional statement.But I am unable to create new array without any duplicates.There is xls having country name with duplicates,i am aiming to remove duplicates and create a new array with unique country names.
For e.g
strFilePath="D:\Country.xls"
Set objExcel = CreateObject("Excel.Application")
objExcel.Visible=True
Set objWorkbook = objExcel.Workbooks.Open (strFilePath)
Set objSheet=objExcel.Sheets("Country")
objExcel.DisplayAlerts = False
objExcel.AskToUpdateLinks = False
objExcel.AlertBeforeOverwriting = False
Dim A(100)
Dim B(100)
For i = 2 To 6 Step 1
k = i-2
A(k)=objSheet.Cells(i,1).Value
Next
B(0)=A(0)
For j = 0 To 4 Step 1
strIt=A(j)
For m = 1 To 4 Step 1
reslt = StrComp(A(m),strIt,1)
If(reslt = 1 Or reslt = -1) Then
c=1
B(c)=A(m)
c=c+1
End if
m=m+1
Next
Next
Two options, depending on your needs:
Try using a hash table of the country names. When entering values in to the hash table you could do a simultaneous check to see whether you encounter an identical value. If it finds one it will abort entering the new value and continue with the next one, otherwise it will be entered in to the table. At the end of it you will have your list of unique country names.
Sort the list of countries and then do a second pass that removes duplicate countries (since duplicates will now be grouped together)
Problems with both of these methods is that they dont preserve original order unless you keep some sort of "original index" value and then sort based on that value once you remove duplicates.
Here's how I usually do it:
Dim uniqueentries()
ReDim uniqueentries(-1)
' Here you could go through your existing array and
' call "GetUniqueEntries" sub on each entry, e.g.
For Each i In oldarray
GetUniqueEntries i
Next
Sub GetUniqueEntries(newentry)
Dim entry
If UBound(uniqueentries) >= 0 Then ' Only check if uniqieentries contains any entries
For Each entry In uniqueentries
If newentry = entry Then Exit Sub ' If the entry you're testing already exists in array then exit sub
Next
End If
ReDim Preserve uniqueentries(UBound(uniqueentries) + 1) ' Increase new array size
uniqueentries(UBound(uniqueentries)) = newentry ' Add unique entry to new array
End Sub
This could be done more simpler way by using Split command. Please check the below solution, if any clarification let me know.
Dim aDupl
Dim aNew, strNew
aDupl = Array("A", "B", "A", "D", "C", "D")
strNew = ""
For iCnt= 0 To UBound(aDupl)
If InStr(strNew,aDupl(iCnt) ) = 0 Then
strNew =strNew&aDupl(iCnt)&","
End If
Next
aNew = Split(strNew, ",")
For iCnt=0 To UBound(aNew)
WScript.Echo aNew(iCnt)
Next
I want to make a program that sort mail from junkmail using a point--system.
For some couple of words in the mail,
I want the program to give different points for each word that I have in my program categorized as "junkwords" where I also have assign different points for different words, so that each word is worth some amount of points.
My pseudocode:
Read text from file
Look for "junk words"
for each word that comes up give the point the word is worth.
If the total points for each junkword is 10 print "SPAM" followed by a list of words that were in the file and categorized as junkwords and their points.
Example (a textfile):
Hello!
Do you have trouble sleeping?
Do you need to rest?
Then dont hesitate call us for the absolute solution- without charge!
So when the programs run and analyzes the text above it should look like:
SPAM 14p
trouble 6p
charge 3p
solution 5p
So what I was planing to write was in this manners:
class junk(object):
fil = open("filnamne.txt","r")
junkwords = {"trouble":"6p","solution":"3p","virus":"4p"}
words = junkwords
if words in fil:
print("SPAM")
else:
print("The file doesn't contain any junk")
So my problem now is how do I give points for each word in my list that comes up in the file?
And how to I sum the total points so that if total_points are > 10 then the program should print "SPAM",
Followed by the list of the 'junkwords' that are found in the file and the total points of each word..
Here is a quick script that might get you close to there:
MAXPOINTS = 10
JUNKWORDS={"trouble":6,"solution":5,"charge":3,"virus":7}
fil = open("filnamne.txt", "r")
foundwords = {}
points = 0
for word in fil.read().split():
if word in JUNKWORDS:
if word not in foundwords:
foundwords[word] = 0
points += JUNKWORDS[word]
foundwords[word] += 1
if points > 10:
print "SPAM"
for word in foundwords:
print word, foundwords[word]*JUNKWORDS[word]
else:
print "The file doesn't contain any junk"
You may want to use .lower() on the words and make all your dictionary keys lowercase. Maybe also remove all non-alphanumeric characters.
Here's another approach:
from collections import Counter
word_points = {'trouble': 6, 'solution': 5, 'charge': 3, 'virus': 7}
words = []
with open('ham.txt') as f:
for line in f:
if line.strip(): # weed out empty lines
for word in line.split():
words.append(word)
count_of_words = Counter(words)
total_points = {}
for word in word_points:
if word in count_of_words:
total_points[word] = word_points[word] * count_of_words[word]
if sum(i[0] for i in total_points.iteritems()) > 10:
print 'SPAM {}'.format(sum(i[0] for i in total_points.iteritems()))
for i in total_points.iteritems():
print 'Word: {} Points: {}'.format(*i)
There are some optimizations you can do, but it should give you an idea of the general logic. Counter is available from Python 2.7 and above.
I have assumed that each word has different points, so I have used a dictionary.
You need to find the number of times a word in words has come in the file.
You should store the point for each word as an integer. not as '6p' or '4p'
So, try this:
def find_junk(filename):
word_points = {"trouble":6,"solution":3,"charge":2,"virus":4}
word_count = {word:0 for word in word_points}
count = 0
found = []
with open(filename) as f:
for line in f:
line = line.lower()
for word in word_points:
c = line.count(word)
if c > 0:
count += c * word_points[word]
found.append(word)
word_count[word] += c
if count >= 10:
print ' SPAM'*4
for word in found:
print '%10s%3s%3s' % (word, word_points[word], word_count[word])
else:
print "Not spam"
find_junk('spam.txt')
I have some sequences in a string denoted by "#number" (/#\d/)
I want to remove any redundant sequences, where #2 is followed by #2,
I only want to remove them if another identical #number sequence is found directly after somewhere in the text, so for #2lorem#2ipsum the 2nd #2 is removed, but for #2lorem#1ipsum#2dolor nothing is removed because #1 is between the two #2 sequences.
"#2randomtext#2randomtext#2randomtext#1bla#2bla2#2bla2"
becomes:
"#2randomtextrandomtextrandomtext#1bla#2bla2bla2
"#2randomtext#2randomtext#2randomtext#1bla#2bla2#2bla2".gsub /(?<=(#\d))([^#]*)\1/,'\2'
=> "#2randomtextrandomtextrandomtext#1bla#2bla2bla2"
You can split it into tokens:
my_string = "#2randomtext#2randomtext#2randomtext#1bla#2bla2#2bla2"
tokens = my_string.scan /(#\d+)?((?:(?!#\d+).)*)/
#=> [["#2", "randomtext"], ["#2", "randomtext"], ["#2", "randomtext"], ["#1", "bla"], ["#2", "bla2"], ["#2", "bla2"]]
Then chunk, map and join:
tokens.chunk{|x| x[0].to_s}.map{|n, v| [n, v.map(&:last)]}.join
#=> "#2randomtextrandomtextrandomtext#1bla#2bla2bla2"
my_string = "#2randomtext#2randomtext#2randomtext#1bla#2bla2#2bla2"
prev_sequence = String.new
penultimate_index = my_string.length - 2
for i in 0..penultimate_index
if my_string[i] == '#'
new_sequence = "##{my_string[i+1]}"
if new_sequence == prev_sequence
my_string.slice!( i, 2 )
else
prev_sequence = new_sequence
end
end
end
puts my_string
easy... split your string into an array, and then compare the number that comes right after that. If it's the same, remove it/them. The complicated (through not that much), is that you can't remove entries from an array while looping through them... so what you need to do is make a recursive function... here's the pseudo:
-= Global values =-
Decalre StringArray and set it to OriginalString.SplitOn("#")
-= Method RemoveLeadingDuplicates =-
Declare Counter
Declare RemoveIndex
loop for each string in StringArray
if previous lead == current lead
Set RemoveIndex
break from loop
else
previous lead = current lead
Increase Counter By 1
end loop
if RemoveIndex is not null
Remove the item at specified index from the array
Call RemoveLeadingDuplicates
Return
Based on the following data in a YAML file, is it possible to create a regular expression in Ruby which matches the respective Group and Item keys from a list
Source Data
Groups:
GroupA:
- item 1
- item 3
Group B:
- itemA
- item 3
C:
- 1
- item 3
Test String:
GroupA item 1
Group B itemA
c item 1
C 1
GroupA 1
Expected Match Groups
Match 1:
1. GroupA
2. item 1
Match 2:
1. Group B
2. itemA
Match 3:
1. C
2. 1
Thanks for any help!
Ian
==================================
Update: Following Tin Mans comment -
Here's some further background...
A class inside a plugin exists which contains a number of methods. Each method receives a string which is parsed to determine what action is performed. In some methods, the contents of the string are used in the subsequent actions - when this is required a regular expression is used to extract (or match) the relevant parts of the string. Unfortunately there is no control over the upstream code to alter this process.
In this case, the string is in the form "Group Item Status". However the group and item names are not necessarily single words and each group does not have to contain all items. e.g.
"Group A Item 1"
"c item 1"
"GroupA 1"
So, what's needed is a method of parsing the input string to get the respective Group and Item so that the correct values are passed to methods further down the line. Given that other comparable methods in the class use regular expressions, and there is a YAML file which contains the definitive list of group - item pairs, a regular expression was my first line of thought.
However, I am open to better approaches
Many thanks
Ian
Why would you want to match anything in a YAML file? Load it into Ruby using the YAML parser, and search it, or modify in memory.
If you want to save the modified file, the YAML parser can emit a Ruby object as YAML, which you then save.
require 'yaml'
yaml = '
---
Groups:
GroupA:
- item 1
- item 3
Group B:
- itemA
- item 3
C:
- 1
- item 3
'
yaml = YAML.load(yaml)
# => {"Groups"=>{"GroupA"=>["item 1", "item 3"], "Group B"=>["itemA", "item 3"], "C"=>[1, "item 3"]}}
yaml['Groups']['GroupA'].first
# => "item 1"
yaml['Groups']['Group B'][1]
# => "item 3"
yaml['Groups']['C'].last
# => "item 3"
Based on the above definitions, manipulating the data could be done like this:
yaml = YAML.load(yaml)
groups = yaml['Groups']
new_group = {
'groupa_first' => groups['GroupA'].first,
'groupb_second' => groups['Group B'][1],
'groupc_last' => groups['C'].last
}
yaml['New Group'] = new_group
puts yaml.to_yaml
Which outputs:
---
Groups:
GroupA:
- item 1
- item 3
Group B:
- itemA
- item 3
C:
- 1
- item 3
New Group:
groupa_first: item 1
groupb_second: item 3
groupc_last: item 3
There's a reason we have YAML parsers for all the different languages; They make it easy to load and use the data. Take advantage of that tool, and use Ruby to modify the data, and, if needed, write it out again. It would be one huge YAML file before I'd even think of trying to modify it on disk considering it's so easy to do in memory.
Now, the question becomes, how do you search the keys of a hash using a regex?
yaml['Groups'].select{ |k,v| k[/^Group/] }
# => {"GroupA"=>["item 1", "item 3"], "Group B"=>["itemA", "item 3"]}
Once you have the ones you want, you can easily modify their contents, substitute them back into the in-memory hash, and write it out.
I am currently following Beginning Ruby by Peter Cooper and have put together my first app, a text analyzer. However, whilst I understand all of the concepts and the way in which they work, I can't for the life of me understand how the app knows to select the middle third of sentences sorted by length from this line:
ideal_sentances = sentences_sorted.slice(one_third, one_third + 1)
I have included the whole app for context any help is much appreciated as so far everything is making sense.
#analyzer.rb --Text Analyzer
stopwords = %w{the a by on for of are with just but and to the my I has some in do}
lines = File.readlines(ARGV[0])
line_count = lines.size
text = lines.join
#Count the characters
character_count = text.length
character_count_nospaces = text.gsub(/\s+/, '').length
#Count the words, sentances, and paragraphs
word_count = text.split.length
paragraph_count = text.split(/\n\n/).length
sentence_count = text.split(/\.|\?|!/).length
#Make a list of words in the text that aren't stop words,
#count them, and work out the percentage of non-stop words
#against all words
all_words = text.scan(/\w+/)
good_words = all_words.select {|word| !stopwords.include?(word)}
good_percentage = ((good_words.length.to_f / all_words.length.to_f)*100).to_i
#Summarize the text by cherry picking some choice sentances
sentances = text.gsub(/\s+/, ' ').strip.split(/\.|\?|!/)
sentances_sorted = sentences.sort_by { |sentence| sentance.length }
one_third = sentences_sorted.length / 3
ideal_sentances = sentences_sorted.slice(one_third, one_third + 1)
ideal_sentances = ideal_sentences.select{ |sentence| sentence =~ /is|are/ }
#Give analysis back to user
puts "#{line_count} lines"
puts "#{character_count} characters"
puts "#{character_count_nospaces} characters excluding spaces"
puts "#{word_count} words"
puts "#{paragraph_count} paragraphs"
puts "#{sentence_count} sentences"
puts "#{sentence_count / paragraph_count} sentences per paragraph (average)"
puts "#{word_count / sentence_count} words per sentence (average)"
puts "#{good_percentage}% of words are non-fluff words"
puts "Summary:\n\n" + ideal_sentences.join(". ")
puts "-- End of analysis."
Obviously I am a beginner so plain English would help enormously, cheers.
It gets a third of the length of the sentence with one_third = sentences_sorted.length / 3 then the line you posted ideal_sentances = sentences_sorted.slice(one_third, one_third + 1) says "grab a slice of all the sentences starting at the index equal to 1/3rd and continue 1/3rd of the length +1".
Make sense?
The slice method in you look it up in the ruby API say this:
If passed two Fixnum objects, returns a substring starting at the
offset given by the first, and a length given by the second.
This means that if you have a sentence broken into three pieces
ONE | TWO | THREE
slice(1/3, 1/3+1)
will return the string starting at 1/3 from the beginning
| TWO | THREE (this is what you are looking at now)
then you return the string that is 1/3+1 distance from where you are, which gives you
| TWO |
sentences is a list of all sentences. sentances_sorted is that list sorted by sentence length, so the middle third will be the sentences with the most average length. slice() grabs that middle third of the list, starting from the position represented by one_third and counting one_third + 1 from that point.
Note that the correct spelling is 'sentence' and not 'sentance'. I mention this only because you have some code errors that result from spelling it inconsistently.
I was stuck on this when I first started too. In plain English, you have to realize that the slice method can take 2 parameters here.
The first is the index. The second is how long slice goes for.
So lets say you start off with 6 sentences.
one_third = 2
slice(one_third, one_third+1)
1/3 of 6 is 2.
1) here the 1/3 means you start at element 2 which is index[1]
2) then it goes on for 2 (6/3) more + 1 length, so a total of 3 spaces
so it is affecting indexes 1 to index 3