Huffmann Code for the given frequency set - data-structures

To generate the Huffman code for given set of frequencies.
{1, 1, 2, 3, 4, 8, 12, 21}
According to my understanding we generate the huffmann code for a given problem according to the frequency (times a symbol occurs) and thus calculating the freqency so the table for the above set would be this :-
Probability Table
Then I tried to make the frequency tree but I am stuck anyone mind helping me a little bit ?

Related

Ways to get combination from set of sets

I have a set of sets with possible repeats across sets, say {{1, 2, 3, 7}, {1, 2, 4, 5}, {1, 3, 5, 6}, {4, 5, 6}}. I want to know if I can get a specific combination, say {1, 2, 3, 4} by choosing one element from each set, and if so, I want to know how many ways I can do this.
I can do this via bruteforcing (finding all ways to get the first element, then ways to get the second element, and so on), but this seems rather painful and inefficient. Is there a better way to go about this?
You can reduce your problem to maximum bipartite matching (It is equivalent actually).
One the left side you have all the elements of your set. On the right side you have all your sets. You connect a number of your left side with a set on the right iff the number is contained in the set.
Now you can now apply an algorithm like Hopcroft-Karp https://de.wikipedia.org/wiki/Algorithmus_von_Hopcroft_und_Karp to find the maximal matching. If it is as big as your set you have an assignment as you requested, otherwise not.
The number of optimal matchings is NP-hard, see https://www.sciencedirect.com/science/article/pii/S0012365X03002048 : But the enumeration problem for perfect matchings in general graphs (even in bipartite graphs) is NP-hard

How to cluster values based on their frequency of occurrence?

I am working on a clustering algorithm where I need to cluster values based on their frequency in the data. This would indicate which values are not important and would be treated as the part of a larger cluster than individual entity.
I am new to data science and would like to know the best algorithm/approach to achieve this.
For example, I have the following data set. The first column are the property values and second column denotes their frequency of occurrence.
Value = [1, 1.5, 2, 3, 4, 6, 8, 16, 32, 128]
Frequency = [207, 19, 169, 92, 36, 7, 12, 5, 2, 2]
Here, Frequency[i] corresponds to Value[i]
The frequency can be thought of as the importance of a value. The other thing which denotes the importance of a value is the distance between the elements in the array. For example, 1.5 is not that significant compared to 32 or 128, since it has elements much closer such as 1 and 2.
When approaching to cluster these values, I need to look at distances between values and also the frequency of their occurrence. A possible output for the above problem would be
Clust_value = [(1, 1.5), 2, 3, 4, (6, 8), 16, (32, 128)]
This is not the best cluster but one possible answer. I need to know the best algorithm to approach this problem.
Firstly, I tried to solve this problem without taking into account the spread of elements in the values array, but that gave wrong answers in some situations. We have tried using mean and median for clustering values again with no successful outcome.
We have tried comparing frequencies of the neighbors and then clubbing the values into one cluster. We also tried to find the minimum distance between the elements of the values array and then putting them into one cluster if their difference was greater than a threshold value, but this failed to cluster values if they had low frequencies. I also looked for clustering algorithms on-line but did not get any useful resource relevant to the problem defined above.
Is there any better way to approach the problem?
You need to come up with some mathematical quality criterion of what makes one solution better than another. Unless you have thousands of numbers, you can afford a rather 'brute force' method: begin with the first number, add the next as long as your quality increases, otherwise begin a new cluster. Because your data are sorted this will be fairly efficient and find a rather good solution (you can try additional splits to further improve quality).
So it all boils down to you needing to specify quality.
Do not assume that existing criteria (e.g. variance in k-means) work for you. At most, you may be able to find a data transformation such that your requirements turn into variance, but that also will be specific to your problem.

Sorting Algorithm : output

I faced this problem on a website and I quite can't understand the output, please help me understand it :-
Bogosort, is a dumb algorithm which shuffles the sequence randomly until it is sorted. But here we have tweaked it a little, so that if after the last shuffle several first elements end up in the right places we will fix them and don't shuffle those elements furthermore. We will do the same for the last elements if they are in the right places. For example, if the initial sequence is (3, 5, 1, 6, 4, 2) and after one shuffle we get (1, 2, 5, 4, 3, 6) we will keep 1, 2 and 6 and proceed with sorting (5, 4, 3) using the same algorithm. Calculate the expected amount of shuffles for the improved algorithm to sort the sequence of the first n natural numbers given that no elements are in the right places initially.
Input:
2
6
10
Output:
2
1826/189
877318/35343
For each test case output the expected amount of shuffles needed for the improved algorithm to sort the sequence of first n natural numbers in the form of irreducible fractions. I just can't understand the output.
I assume you found the problem on CodeChef. There is an explanation of the answer to the Bogosort problem here.
Ok I think I found the answer, there is a similar problem here https://math.stackexchange.com/questions/20658/expected-number-of-shuffles-to-sort-the-cards/21273 , and this problem can be thought of as its extension

Exhaustive random number generator

In one of my project I encountered a need to generate a set of numbers in a given range that will be:
Exhaustive, which means that it will cover the most of the given
range without any repetition.
It will guarantee determinism (every time the sequence will be the
same). This can be probably achieved with a fixed seed.
It will be random (I am not very versed into Random Number Theory, but I guess there is a bunch of rules that describes randomness. From perspective something like 0,1,2..N is not random).
Ranges I am talking about can be ranges of integers, or of real numbers.
For example, if I used standard C# random generator to generate 10 numbers in range [0, 9] I will get this:
0 0 1 2 0 1 5 6 2 6
As you can see, a big part of given range still remains 'unexplored' and there are many repetitions.
Of course, input space can be very large, so remembering previously chosen values is not an option.
What would be the right way to tackle this problem?
Thanks.
After the comments:
Ok i agree that the random is not the right word, but I hope that you understood what I am trying to achieve. I want to explore given range that can be big so in memory list is not an option. If a range is (0, 10) and i want three numbers i want to guarantee that those numbers will be different and that they will 'describe the range' (i.e. They wont all be in a lower half etc).
Determinism part means that i would like to use something like standard rng with a fixed seed, so I can fully control the sequence.
I hope i made things a bit clearer.
Thanks.
Here's three options with different tradeoffs:
Generate a list of numbers ahead of time, and shuffle them using the fisher-yates shuffle. Select from the list as needed. O(n) total memory, and O(1) time per element. Randomness is as good as the PRNG you used to do the shuffle. The simplest of the three alternatives, too.
Use a Linear Feedback Shift Register, which will generate every value in its sequence exactly once before repeating. O(log n) total memory, and O(1) time per element. It's easy to determine future values based on the present value, however, and LFSRs are most easily constructed for power of 2 periods (but you can pick the next biggest power of 2, and skip any out of range values).
Use a secure permutation based on a block cipher. Usable for any power of 2 period, and with a little extra trickery, any arbitrary period. O(log n) total space and O(1) time per element, randomness is as good as the block cipher. The most complex of the three to implement.
If you just need something, what about something like this?
maxint = 16
step = 7
sequence = 7, 14, 5, 12, 3, 10, 1, 8, 15, 6, 13, 4, 11, 2, 9, 0
If you pick step right, it will generate the entire interval before repeating. You can play around with different values of step to get something that "looks" good. The "seed" here is where you start in the sequence.
Is this random? Of course not. Will it look random according to a statistical test of randomness? It might depend on the step, but likely this will not look very statistically random at all. However, it certainly picks the numbers in the range, not in their original order, and without any memory of the numbers picked so far.
In fact, you could make this look even better by making a list of factors - like [1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15, 16] - and using shuffled versions of those to compute step * factor (mod maxint). Let's say we shuffled the example factors lists like [3, 2, 4, 5, 1], [6, 8, 9, 10, 7], [13, 16, 12, 11, 14, 15]. then we'd get the sequence
5, 14, 12, 3, 7, 10, 8, 15, 6, 1, 11, 0, 4, 13, 2, 9
The size of the factors list is completely tunable, so you can store as much memory as you like. Bigger factor lists, more randomness. No repeats regardless of factor list size. When you exhaust a factor list, generating a new one is as easy as counting and shuffling.
It is my impression that what you are looking for is a randomly-ordered list of numbers, not a random list of numbers. You should be able to get this with the following pseudocode. Better math-ies may be able to tell me if this is in fact not random:
list = [ 1 .. 100 ]
for item,index in list:
location = random_integer_below(list.length - index)
list.switch(index,location+index)
Basically, go through the list and pick a random item from the rest of the list to use in the position you are at. This should randomly arrange the items in your list. If you need to reproduce the same random order each time, consider saving the array, or ensuring somehow that random_integer_below always returns numbers in the same order given some seed.
Generate an array that contains the range, in order. So the array contains [0, 1, 2, 3, 4, 5, ... N]. Then use a Fisher-Yates Shuffle to scramble the array. You can then iterate over the array to get your random numbers.
If you need repeatability, seed your random number generator with the same value at the start of the shuffle.
Do not use a random number generator to select numbers in a range. What will eventually happen is that you have one number left to fill, and your random number generator will cycle repeatedly until it selects that number. Depending on the random number generator, there is no guarantee that will ever happen.
What you should do is generate a list of numbers on the desired range, then use a random number generator to shuffle the list. The shuffle is known as the Fisher-Yates shuffle, or sometimes called the Knuth shuffle. Here's pseudocode to shuffle an array x of n elements with indices from 0 to n-1:
for i from n-1 to 1
j = random integer such that 0 ≤ j ≤ i
swap x[i] and x[j]

Bogosort optimization, probability related

I'm coding a question on an online judge for practice . The question is regarding optimizing Bogosort and involves not shuffling the entire number range every time. If after the last shuffle several first elements end up in the right places we will fix them and don't shuffle those elements furthermore. We will do the same for the last elements if they are in the right places. For example, if the initial sequence is (3, 5, 1, 6, 4, 2) and after one shuffle Johnny gets (1, 2, 5, 4, 3, 6) he will fix 1, 2 and 6 and proceed with sorting (5, 4, 3) using the same algorithm.
For each test case output the expected amount of shuffles needed for the improved algorithm to sort the sequence of first n natural numbers in the form of irreducible fractions.
A sample input/output says that for n=6, the answer is 1826/189.
I don't quite understand how the answer was arrived at.
This looks similar to 2011 Google Code Jam, Preliminary Round, Problem 4, however the answer is n, I don't know how you get 1826/189.

Resources