insert same values into binomial heap - algorithm

I have to insert some values into binomial heap. For example 25, 26, 24, 60, 65, 62. It will look like follows:
But then I have to insert 25, 68, 65 to the same heap. Should I insert 25 again or just skip it as it is already present in the heap?

It is up to your implementation and specific requirements. Do you need duplicate elements? A binomial heap can support inserting the same value multiple times and will perform just as good(if you implement it correctly), but this does not mean that it should in your case.

Related

Algorithm Complexity: Time and Space

I have two solutions to one coding problem. Both of the solutions work fine. I just need to find the time/space complexity for both of them.
Question: Given a string or words, return the largest set of unique words in the given string that are anagrams of each other.
Example:
Input: 'I am bored and robed, nad derob'
Correct output: {bored, robed, derob}
Wrong output: {and, nad}
Solution 1:
In the first solution, I iterate over the given string of words, take each word, sort the characters in it, add it to a dictionary as a key. And the original word (not sorted) is being added to a set of words as a value for that key. Sort helps figure out the words that are anagrams of each other. At each iteration, I also keep track of the key that has the longest set of words as its value. At the end, I return the key that has the longest set.
Solution 2:
In the second solution, I do almost the same. The only difference is that I calculate a prime factor for each word to use it as a key in my dictionary. In other words, I don't sort the characters of the word.
def solution_two(string):
primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47,
53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101]
d = defaultdict(set)
longest_set = ''
for word in string.replace(",", "").replace(".", "").split():
key = 1
for ch in word:
key *= primes[ord(ch.lower()) - 97]
d[key].add(word)
if len(d[key]) > len(d[longest_set ]):
longest_set = key
return d[longest_set]
My thoughts:
I think that the runtime of the first solution is O(n), where n is the number of words in the string. But if I sort each word of the string, wouldn't it make the runtime O(n) * O(n log n)?
As for the second one, I think that it has the linear runtime too. But I have the second loop inside the first one where I iterate through each character of a word...
I am also confused about the space complexity for both of the solutions.
Any guidance would be greatly appreciated. Thank you.
ALG1= time is O(n)* O(awl log(awl)) and space O(n)*O(awl) but be careful because this looks pretty good but you need to consider that if awl is much smaller than n you get an ~O(n) time while if your awl is bigger you get an O(awl log(awl)) meanwhile worst case of all is if n=awl –
ALG2= time O(n)*O(awl), where O(n) is for iterating the given string and O(awl) is calculating the prime factor for each word (i.e. I take a word, find a prime for each char in the word in the list primes and multiply them) and space O(n)
same considerations on n and awl as the previous algorithm
So if i am correct your second algorithm in the WC has a complexity of n² better than the other one also in space!
Just to recap :)

In what scenarios hash partitioning is preferred over range partitioning in Spark?

I have gone through various articles about hash partitioning. But I still don't get it in what scenarios it is more advantageous than range partitioning. Using sortByKey followed by range partitioning allows data to be distributed evenly across cluster. But that may not be the case in hash partitioning. Consider the following example:
Consider a pair RDD with keys [8, 96, 240, 400, 401, 800] and the desired number of partition is 4.
In this case, hash partitioning distributes the keys as follows among the
partitions:
partition 0: [8, 96, 240, 400, 800]
partition 1: [ 401 ]
partition 2: []
partition 3: []
(To compute partition : p = key.hashCode() % numPartitions )
The above partition leads to bad performance as the keys are not evenly distributed across all nodes. Since range partition can equally distribute the keys across the cluster, then in what scenarios hash partition proves to be a best fit over range partition?
While weakness of the hashCode is of some concern, especially when working with small integers, it usually can be addressed by adjusting number of partitions based on domain specific knowledge. It is also possible to replace default HashPartitioner with custom Partitioner using more appropriate hashing function. As long as there is no data skew, hash partitioning behaves well enough at scale on average.
Data skews are completely different problem. If key distribution is significantly skewed, then distribution of the partitioned data, is likely to be skewed, no matter what Partitioner is used. Consider for example following RDD:
sc.range(0, 1000).map(i => if(i < 9000) 1 else i).map((_, None))
which simply cannot be uniformly partitioned.
Why not use RangePartitioner by default?
It is less general than HashPartioner. While HashPartitioner requires only a proper implementation of ## and == for K, RangePartitioner requires an Ordering[K].
Unlike HashPartitioner, it has to approximate data distribution, therefore it requires additional data scan.
Because splits are computed based on a particular distribution, it might be unstable when reused across datasets. Consider following example:
val rdd1 = sc.range(0, 1000).map((_, None))
val rdd2 = sc.range(1000, 2000).map((_, None))
val rangePartitioner = new RangePartitioner(11, rdd1)
rdd1.partitionBy(rangePartitioner).glom.map(_.length).collect
Array[Int] = Array(88, 91, 99, 91, 87, 92, 83, 93, 91, 86, 99)
rdd2.partitionBy(rangePartitioner).glom.map(_.length).collect
Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1000)
As you can imagine this has serious implications for operations like joins. At the same time
val hashPartitioner = new HashPartitioner(11)
rdd1.partitionBy(hashPartitioner).glom.map(_.length).collect
Array[Int] = Array(91, 91, 91, 91, 91, 91, 91, 91, 91, 91, 90)
rdd2.partitionBy(hashPartitioner).glom.map(_.length).collect
Array[Int] = Array(91, 91, 91, 91, 91, 91, 91, 91, 91, 90, 91)
This brings us back to your questions:
in what scenarios it is more advantageous than range partitioning.
Hash partitioning is a default approach in many systems because it is relatively agnostic, usually behaves reasonably well, and doesn't require additional information about data distribution. These properties make it preferable, in lack of any a priori knowledge about the data.

Sorting unspecified amount of numbers in Lua

I need to sort and unspecifed amount of numbers in Lua. For example if I have theese numbers 15,21,31,50,32,11,11. I need to lua to sort them so the first one is the biggest like this: 50,32,31,21,15,11,11.
What is the easiest way to do this? Remember it got to work with an unspecified amont of numbers. Thanks!
table.sort sorts a table in place. By default, it uses < to compare elements. To sort them with the bigger element before smaller element:
local t = {15, 21, 31, 50, 32, 11, 11}
table.sort(t, function(a, b) return a > b end)
The number of elements doesn't matter, as a table can hold as many elements as possible.

Why shouldn't radix sort be implemented starting with the MSD?

I'm reading Introduction to Algorithms by Cormen et. al. and there in the part where they describe Radix sort, they say:
Intuitively you might sort numbers on their most significant digit,
sort each of the resulting bins recursively and then combine the decks
in order. Unfortunately since the cards in 9 of the 10 bins must be
put aside to sort each of the bins, this procedure generates many
intermediate piles of cards that you'd have to keep a track of.
What does this mean?
I don't understand how sorting by the MSB would be a problem?
Consider the following example list to sort:
170, 045, 075, 090, 002, 024, 802, 066, 182, 332, 140, 144
Sorting by most significant digit (hundreds) gives:
Zero hundreds bucket: 045, 075, 090, 002, 024, 066
One hundreds bucket: 170, 182, 140, 144
Three hundreds bucket: 332
Eight hundreds bucket: 802
Sorting by next digit is now needed for numbers in the zero and one hundreds bucket (the other two buckets only contain one item each):
Zero tens: 002
Twenties: 024
Forties: 045
Sixties: 066
Seventies: 075
Nineties: 090
Sorting by least significant digit (1s place) is not needed, as there is no tens bucket with more than one number. That's not the case with the one hundreds bucket though (exercise: recursively sort it yourself). Therefore, the now sorted zero hundreds bucket is concatenated, joined in sequence, with the one, three and eight hundreds bucket to give:
002, 024, 045, 066, 075, 090, 140, 144, 170, 182, 332, 801
You can see that the authors are referring to the intermediate recursive sorting steps, which are not necessary in an LSD radix sort.
They refer to a useful property of an LSD radix sort that since you ensure each sorting step is stable, you only have to run one step for each digit, on the whole array, and you don't have to individually sort any subsets.
To take Michael's example data:
After 0 steps:
170, 045, 075, 090, 002, 024, 802, 066, 182, 332, 140, 144
After 1 step (sort on units):
170, 090, 140, 002, 802, 182, 332, 024, 144, 045, 075, 066
After 2 steps (sort on tens):
002, 802, 024, 332, 140, 144, 045, 066, 170, 075, 182, 090
After 3 steps (sort on hundreds):
002, 024, 045, 066, 075, 090, 140, 144, 170, 182, 332, 802
This property becomes especially useful if you're radix-sorting in binary rather than base 10. Then each sorting step is just a partition into two, which is very simple. At least, it is until you want to do it without using any extra memory.
MSD radix sort works, of course, it just requires more book-keeping and/or a non-tail recursion. It's only a "problem" in that CLRS (in common with other expert programmers) don't like to do fiddly work until it's necessary.

Getting indices of sorting integer array

i have an array with integers, which i need to sort. however the result should not contain the integer values, but the indices. i.e. the new order of the old array.
for example: [10, 20, 30]
should result in: [2, 1, 0]
what is an optimized algorithm to achieve this?
You can achieve this with any sorting algorithm, if you convert each element to a tuple of (value, position) and sort this.
That is, [10, 20, 30] would become [(10, 0), (20, 1), (30, 2)]. You'd then sort this array using a comparator that looks at the first element of the tuples, giving you [(30, 2), (20, 1), (10, 0)]. From this, you can simply grab the second element of each tuple to get what you want, [2, 1, 0]. (Under the assumption you want reverse sorting.)
Won't be different from any other sorting algorithm, just modify it so that it builds or takes in an array of indices and then manipulates both the data and array of indices instead of just the data.
You could create a array of pointers to the original array of integers, perform a merge sort or what ever sorting algorithm you find most suiting (uses the value at the pointer) then just run down the list calculating the indicies based on each pointers relative address to the beginning of the allocated block containing the original array of integers.

Resources