How to calculate entropy for each class? - entropy

I'm working on an assignment in school and I've been stuck for 3 days now, so hoping to get some help with this problem.
The code is commented as you can see, but this is my problem: I try to calculate the entropy for each class, but i do not know how. This is an attempt to calculate the #probability and the entropy at once, but i have no idea if that is correct. Any advice?
buys <- c("no", "no", "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "yes", "yes", "yes", "no")
credit <- c("fair", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "excellent")
student <- c("no", "no", "no","no", "yes", "yes", "yes", "no", "yes", "yes", "yes", "no", "yes", "no")
income <- c("high", "high", "high", "medium", "low", "low", "low", "medium", "low", "medium", "medium", "medium", "high", "medium")
age <- c(25, 27, 35, 41, 48, 42, 36, 29, 26, 45, 23, 33, 37, 44) # we change the age from categorical to numeric
data <- data.frame(age, income, student, credit, buys) # create a data frame
info <- function(CLASS.FREQ){
freq.class <- CLASS.FREQ
info <- 0
for(i in 1:length(freq.class)){
if(freq.class[[i]] != 0){ # if the number of examples in class i is not 0
entropy<- -sum(freq.class[i]/length(freq.class) * log2(freq.class[i]/freq.class))
# this is my problem. I try to calculate the entropy for each class, but i do not know how. This is an attempt to calculate the #probability and the entropy at once, but i have ni idea if that is correct. Any advice?
}else{
entropy <- 0 # if we face log(0), the entropy is given 0
}
info <- info + entropy # sum up entropy from all classes
}
return(info)
}
buys.freq <- table(buys)
buys.freq
info.buys <- info(buys.freq) #while calculating the info for buys, the result should be 0.940286.
info.buys

Related

How to create a sub array of given array of binary numbers based on number of 1's in Ruby?

Example:
Here is binary numbers array:
a = [001, 010, 100, 011, 101, 110, 111, 1000, 1001, 1010]
I want output like below:
[ [ 001, 010, 100, 1000 ], [ 011, 101, 110, 1001, 1010 ], [ 111 ] ]
Can anybody help me how to achieve it in ruby?
I'm going to assume you're working with strings ("001") and not decimal/octal literals (001). If that's not the case, I strongly suggest casting to strings to make things easier on you.
We can count the number of ones in a string x with x.count('1'). Then we can take a list of strings and organize it by this value with a.group_by(...). This gives a hash, so if you just want the values (as your suggested output suggests), then you simply take the values of it.
a.group_by { |x| x.count('1') }.values
Using Enumerable#group_by, as #Silvio has done, seems the most direct way to solve this problem, but here are a couple of other approaches one could use.
a = "001, 010, 100, 011, 101, 110, 111, 1000, 1001, 1010".split(', ')
#=> ["001", "010", "100", "011", "101", "110", "111", "1000", "1001", "1010"]
Construct a hash whose keys, k, are numbers of ones and whose values are arrays containing the elements from the original array whose numbers of one1 equal k
a.each_with_object({}) { |s,h| (h[s.count('1')] ||= []) << s }.values
#=> [["001", "010", "100", "1000"], ["011", "101", "110", "1001", "1010"], ["111"]]
Note values is applied to the hash returned by the block, namely
{1=>["001", "010", "100", "1000"], 2=>["011", "101", "110", "1001", "1010"], 3=>["111"]}
Consider the expression, (h[s.count('1')] ||= []) << s. Let
cnt = s.count('1')
Then (h[cnt] ||= []) << s expands to the following when parsed.
(h[cnt] = h[cnt] || []) << s
If h does not have a key cnt, then h[cnt] on the right of the equality equals nil, so the expression reduces to
(h[cnt] = []) << s
so h[cnt] #=> [s]. On the other hand, if h does have a key cnt, h[cnt] equals an array, which is truthy, so we execute
h[cnt] << s
Note that in h[cnt] = h[cnt] || [], the method on the left of the equality is Hash#[]=, whereas we have Hash#[] is on the right of the equality.
Sort then slice
a.sort_by { |s| s.count('1') }.slice_when { |s1,s2| s1.count('1') < s2.count('1') }.to_a
#=> [["001", "010", "100", "1000"], ["011", "101", "110", "1001", "1010"], ["111"]]

Algorithm for searching "unstructured" dataset (of parameters and values)

Given an unstructured dataset consisting of sets of various parameters with numeric quantities, what's an efficient and practical algorithm for searching? The parameters vary wildly, with no exhaustive list of parameters, and any parameter can be part of any set.
That was either a very good problem description, or just confusing. So let me try an example:
"dataset" : [
{"a": 4, "b": 1},
{"a": 4, "b": 1, "c": 0.5},
{"a": 1, "b": 3, "c": 0.5, "x": 1},
{"x": 3, "t": 0.01}
]
search input (to match/score against dataset):
q = {"a": 2, "b": 1}
I'm thinking a matching/scoring rule along the lines of:
for each "set", s, in the dataset, scan through the parameters of s. If q contains same parameter (name/key), then let v be the quantity (value) of that parameter in s. Let w be corresponding value in q, and this parameter is scored, max(w/v, 1.0).
Repeat for each parameter of s, producing an overall score (of as the product of all the "sub scores").
So, q scores
2/4 * 1/1 = 0.5 against the two first sets, 0.33 against the third set, and 0 against the last one. I'm not sure how to handle parameters in s that are not in q, but maybe those could give some secondary score (for those "hits" where score > 0).
Any tips on what to search (google) for here, any well-suited algorithms on something like this?

Summing and comparing arrays in MongoDB

I'm very new to mongodb, I've done simple stuff like storing and retrieving documents.
I have a collection of documents (thousands and growing) with and embedded array of integers (can be as large as 5000 integers) between 0 and 255
Example Mongo Collection Data:
{
"name": "item1",
"values": [1, 93, 45, 67, 89, 1, 2, 32, 45]
},
{
"name": "item2",
"values": [1, 23, 45, 123, 1, 5, 89, 14, 22]
},
{
"name": "item3",
"values": [23, 1, 44, 78, 89, 22, 150, 23, 12]
},
{
"name": "item4",
"values": [90, 23, 11, 67, 29, 1, 2, 1, 45]
}
Comparison would be:
pseudo code:
distance = 0
for a in passed_in_item
for b in mongo_collection
distance += a - b
end
end
an example passed in array (same as the ones in the mongo document, they will always be the same length):
[1, 93, 45, 67, 89, 1, 2, 32, 45]
I'd like to pass in an array of integers as a query and difference it against the array in the document to find the one with the least difference. Is this the sort of thing map reduce is good at and how would I roughly go about it? An example would be great. Also eventually I'd like the passed in array to come from another document in Mongo in a different collection.
Thanks!

Method for padding an array in Ruby

Here's what I have now and it is somewhat working:
def padding(a, b, c=nil)
until a[b-1]
a << c
end
end
This is when it works:
a=[1,2,3]
padding(a,10,"YES")
=>[1, 2, 3, "YES", "YES", "YES", "YES", "YES", "YES", "YES"]
a[1,2,3]
padding(a,10,1)
=>[1, 2, 3, 1, 1, 1, 1, 1, 1, 1]
But it crashes when I do not enter a value for "c"
a=[1,2,3]
padding(a,10)
Killed
How should I append this to avoid a crash?
Additionally, how would you suggest changing this method to use it as follows:
[1,2,3].padding(10)
=>[1,2,3,nil,nil,nil,nil,nil,nil,nil]
[1,2,3].padding(10, "YES")
=>[1, 2, 3, "YES", "YES", "YES", "YES", "YES", "YES", "YES"]
I've seen other padding methods on SO, but they don't seem to be working as intended by the authors. So, I decided to give making my own a shot.
Do you know Array#fill method :-
It does, what you exactly looking for. If it exist, why you want your own.
arup#linux-wzza:~> pry
[1] pry(main)> a=[1,2,3]
=> [1, 2, 3]
[2] pry(main)> a.fill('YES', 3...10)
=> [1, 2, 3, "YES", "YES", "YES", "YES", "YES", "YES", "YES"]
[3] pry(main)>
You can fill your array, whatever way you want. It is a cool implementation. Give it a try.
Read it in your console :
arup#linux-wzza:~> ri Array#fill
= Array#fill
(from ruby site)
------------------------------------------------------------------------------
ary.fill(obj) -> ary
ary.fill(obj, start [, length]) -> ary
ary.fill(obj, range ) -> ary
ary.fill { |index| block } -> ary
ary.fill(start [, length] ) { |index| block } -> ary
ary.fill(range) { |index| block } -> ary
------------------------------------------------------------------------------
The first three forms set the selected elements of self (which may be the
entire array) to obj.
A start of nil is equivalent to zero.
A length of nil is equivalent to the length of the array.
The last three forms fill the array with the value of the given block, which
is passed the absolute index of each element to be filled.
Negative values of start count from the end of the array, where -1 is the last
element.
a = [ "a", "b", "c", "d" ]
a.fill("x") #=> ["x", "x", "x", "x"]
a.fill("z", 2, 2) #=> ["x", "x", "z", "z"]
a.fill("y", 0..1) #=> ["y", "y", "z", "z"]
a.fill { |i| i*i } #=> [0, 1, 4, 9]
a.fill(-2) { |i| i*i*i } #=> [0, 1, 8, 27]
It is killed, because you are entering infinite loop. until a[b-1] will not finish, because when you add nils to the array, you will get:
a == [1, 2, 3, nil, nil, nil, nil, nil, nil, nil]
after few iterations and a[b-1] will be nil, which is falsey. Until will never stop.
About the second question, it is easy to extend existing Array class:
class Array
def padding(i, value=nil)
(i - length).times { self << value }
self
end
end
Result as you expected:
[1,2,3].padding(10)
#=> [1, 2, 3, nil, nil, nil, nil, nil, nil, nil]
[1,2,3].padding(10, "YES")
#=> [1, 2, 3, "YES", "YES", "YES", "YES", "YES", "YES", "YES"]
Note the method about modifies existing array (so due to Ruby conventions should be called padding!):
a = [1,2,3]
#=> [1, 2, 3]
a.padding(10, "YES")
#=> [1, 2, 3, "YES", "YES", "YES", "YES", "YES", "YES", "YES"]
a
#=> [1, 2, 3, "YES", "YES", "YES", "YES", "YES", "YES", "YES"]
But of course you can easy create the version of the method which doesn't modify. I assumed you want to modify the array, because your original method did it.
Arup has nailed it, but here's another way:
def padding(a,b,c)
[*a, *[c]*b]
end
a=[1,2,3]
padding(a,5,"YES")
#=> [1, 2, 3, "YES", "YES", "YES", "YES", "YES"]
The problem is that nil is evaluated as false, so until a[b-1] is never true when a[b-1] contains nil... so you loop forever until you're out of memory.
better to do...
def padding(a, b, c=nil)
until a.size >= b
a << c
end
end
EDIT
(yes, Arup's answer is pretty neat)
You can do this as a one-liner, which is a bit more compact...
def padding(a, b, c=nil)
a << c until a.size >= b
end
To specifically implement your padding method on Array:
module Padding
refine Array do
def padding(new_length, element=nil)
if self.size < new_length
self.concat(Array.new(new_length - self.size, element))
end
end
end
end
using Padding
puts [1,2,3].padding(10).inspect
# => [1, 2, 3, nil, nil, nil, nil, nil, nil, nil]
puts [1,2,3].padding(10, "YES").inspect
# => [1, 2, 3, "YES", "YES", "YES", "YES", "YES", "YES", "YES"]
EDIT: Forgot about Array#fill. Arup's answer is cool (even if you need to say fill(3, 7) instead of fill(-1, 10), as the latter gives the wrong result). It would have been better to use it instead of concat(Array.new(...)). Eh well. :)

Using Enumerable#zip on an Array of Arrays

I am trying to use Enumerable#zip on an array of arrays in order to group the elements of the first nested array with the corresponding elements of each subsequent nested array. This is my array:
roster = [["Number", "Name", "Position", "Points per Game"],
["12","Joe Schmo","Center",[14, 32, 7, 0, 23] ],
["9", "Ms. Buckets ", "Point Guard", [19, 0, 11, 22, 0] ],
["31", "Harvey Kay", "Shooting Guard", [0, 30, 16, 0, 25] ],
["7", "Sally Talls", "Power Forward", [18, 29, 26, 31, 19] ],
["22", "MK DiBoux", "Small Forward", [11, 0, 23, 17, 0] ]]
I want to group "Number" with "12", "9", "31", "7", and "22", and then do the same for "Name", "Position", etc. using zip. The following gives me the output I want:
roster[0].zip(roster[1], roster[2], roster[3], roster[4], roster[5])
How can I reformat this so that if I added players to my roster, they would be automatically included in the zip without me having to manually type in roster[6], roster[7], etc. I've tried using ranges in a number of ways but nothing seems to have worked yet.
First extract the head and tail of the list (header and rows, respectively) using a splat, then zip them together:
header, *rows = roster
header.zip(*rows)
This is the same as using transpose on the original roster:
header, *rows = roster
zipped = header.zip(*rows)
roster.transpose == zipped #=> true
:zip.to_proc[*roster]
a bit more flexible than transpose:
:zip.to_proc[*[(0..2), [:a, :b, :c]]] #=> [[0, :a], [1, :b], [2, :c]]
p roster.transpose()
.......................
roster[0].zip(*(roster[1..-1]))
Doesn't matter how many are in the roster array.

Resources