binary search tree - binary-tree

This is my homework and I have thought a lot about it but I could not get the answer I need your guidance ,please help me thanks
Q:
we have keys from 1 to 1000 in BST and we want to find the key = 363
which of these following searching is not correct?
<925, 202, 911, 240, 912, 245, 363>
<924, 220, 911, 244, 898, 258, 362, 363>

Hint: When searching in a sorted BST, the upper and lower bounds should only get tighter.

<925, 202, 911, 240, 912, 245, 363>
Doesn't make sense
From 911, you're taking the smaller branch to 240. You then somehow arrive at 912. This should be impossible.
If the left child of any node is smaller than its parent, then ALL elements in the left subtree should be smaller than their parent. 912 > 911, therefore it's in the wrong subtree.

Related

Algorithm Complexity: Time and Space

I have two solutions to one coding problem. Both of the solutions work fine. I just need to find the time/space complexity for both of them.
Question: Given a string or words, return the largest set of unique words in the given string that are anagrams of each other.
Example:
Input: 'I am bored and robed, nad derob'
Correct output: {bored, robed, derob}
Wrong output: {and, nad}
Solution 1:
In the first solution, I iterate over the given string of words, take each word, sort the characters in it, add it to a dictionary as a key. And the original word (not sorted) is being added to a set of words as a value for that key. Sort helps figure out the words that are anagrams of each other. At each iteration, I also keep track of the key that has the longest set of words as its value. At the end, I return the key that has the longest set.
Solution 2:
In the second solution, I do almost the same. The only difference is that I calculate a prime factor for each word to use it as a key in my dictionary. In other words, I don't sort the characters of the word.
def solution_two(string):
primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47,
53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101]
d = defaultdict(set)
longest_set = ''
for word in string.replace(",", "").replace(".", "").split():
key = 1
for ch in word:
key *= primes[ord(ch.lower()) - 97]
d[key].add(word)
if len(d[key]) > len(d[longest_set ]):
longest_set = key
return d[longest_set]
My thoughts:
I think that the runtime of the first solution is O(n), where n is the number of words in the string. But if I sort each word of the string, wouldn't it make the runtime O(n) * O(n log n)?
As for the second one, I think that it has the linear runtime too. But I have the second loop inside the first one where I iterate through each character of a word...
I am also confused about the space complexity for both of the solutions.
Any guidance would be greatly appreciated. Thank you.
ALG1= time is O(n)* O(awl log(awl)) and space O(n)*O(awl) but be careful because this looks pretty good but you need to consider that if awl is much smaller than n you get an ~O(n) time while if your awl is bigger you get an O(awl log(awl)) meanwhile worst case of all is if n=awl –
ALG2= time O(n)*O(awl), where O(n) is for iterating the given string and O(awl) is calculating the prime factor for each word (i.e. I take a word, find a prime for each char in the word in the list primes and multiply them) and space O(n)
same considerations on n and awl as the previous algorithm
So if i am correct your second algorithm in the WC has a complexity of n² better than the other one also in space!
Just to recap :)

In what scenarios hash partitioning is preferred over range partitioning in Spark?

I have gone through various articles about hash partitioning. But I still don't get it in what scenarios it is more advantageous than range partitioning. Using sortByKey followed by range partitioning allows data to be distributed evenly across cluster. But that may not be the case in hash partitioning. Consider the following example:
Consider a pair RDD with keys [8, 96, 240, 400, 401, 800] and the desired number of partition is 4.
In this case, hash partitioning distributes the keys as follows among the
partitions:
partition 0: [8, 96, 240, 400, 800]
partition 1: [ 401 ]
partition 2: []
partition 3: []
(To compute partition : p = key.hashCode() % numPartitions )
The above partition leads to bad performance as the keys are not evenly distributed across all nodes. Since range partition can equally distribute the keys across the cluster, then in what scenarios hash partition proves to be a best fit over range partition?
While weakness of the hashCode is of some concern, especially when working with small integers, it usually can be addressed by adjusting number of partitions based on domain specific knowledge. It is also possible to replace default HashPartitioner with custom Partitioner using more appropriate hashing function. As long as there is no data skew, hash partitioning behaves well enough at scale on average.
Data skews are completely different problem. If key distribution is significantly skewed, then distribution of the partitioned data, is likely to be skewed, no matter what Partitioner is used. Consider for example following RDD:
sc.range(0, 1000).map(i => if(i < 9000) 1 else i).map((_, None))
which simply cannot be uniformly partitioned.
Why not use RangePartitioner by default?
It is less general than HashPartioner. While HashPartitioner requires only a proper implementation of ## and == for K, RangePartitioner requires an Ordering[K].
Unlike HashPartitioner, it has to approximate data distribution, therefore it requires additional data scan.
Because splits are computed based on a particular distribution, it might be unstable when reused across datasets. Consider following example:
val rdd1 = sc.range(0, 1000).map((_, None))
val rdd2 = sc.range(1000, 2000).map((_, None))
val rangePartitioner = new RangePartitioner(11, rdd1)
rdd1.partitionBy(rangePartitioner).glom.map(_.length).collect
Array[Int] = Array(88, 91, 99, 91, 87, 92, 83, 93, 91, 86, 99)
rdd2.partitionBy(rangePartitioner).glom.map(_.length).collect
Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1000)
As you can imagine this has serious implications for operations like joins. At the same time
val hashPartitioner = new HashPartitioner(11)
rdd1.partitionBy(hashPartitioner).glom.map(_.length).collect
Array[Int] = Array(91, 91, 91, 91, 91, 91, 91, 91, 91, 91, 90)
rdd2.partitionBy(hashPartitioner).glom.map(_.length).collect
Array[Int] = Array(91, 91, 91, 91, 91, 91, 91, 91, 91, 90, 91)
This brings us back to your questions:
in what scenarios it is more advantageous than range partitioning.
Hash partitioning is a default approach in many systems because it is relatively agnostic, usually behaves reasonably well, and doesn't require additional information about data distribution. These properties make it preferable, in lack of any a priori knowledge about the data.

Find the max weight subset which to carry

A person has items with below weights.
[10, 10, 12, 15, 16, 20, 45, 65, 120, 140, 150, 178, 198, 200, 210, 233, 298 , 306, 307, 310 , 375, 400, 420 , 411 , 501, 550, 662, 690 ,720, 731, 780, 790]
And the maximum weight he can carry home is 3 kg (3000 grams). He wants to cary as much as maximum.
Note i tried with backtracking algorithm but it's giving me subsets which are equal to the sum which i am looking but in a case where i can't find equal match sum then it's failed. I want to find the subset which is close to the sum.
This is the subset sum problem that is solveable in Dynamic Programming - which is basically an efficient implementation of your backtracking one, by following the following recursion formulas:
D(0,n) = True
D(x,0) = False | x > 0
D(x,n) = D(x-arr[i], n-1) OR D(x,n-1)
^ ^
item i is in subset item i is not in subset
By using bottom-up Dynamic Programming (creating a matrix and filling it from lower to higher) or top-down Dynamic Programming (memorizing every result and checking if it's already calculated before recursing), this is solveable in O(n*W), where n is the number of elements and W is the size of the subset (3000 in your case).
If you run bottom-up DP, the largest value of x such that D(x,n) = True, is the maximum weight one can carry. For finding the actual items, one should follow the table back, examine which items was added at each decision point, and yield the items that were added. Returning the actual set is explained with more details in the thread: How to find which elements are in the bag, using Knapsack Algorithm [and not only the bag's value]? (This thread is dealing with knapsack problem, which is a variant of your problem, with weight=cost for each item)
Using Backtracking, we can frame the solution like this,
We will try to return the maximum weight of the subset which is nearest but also lower to the given weight using this Pseudo Code:
func(weight_till_now,curr_pos)
if(weight_till_now > max_possible) return 0
if(curr_pos >= N) return 0
// Taking this weight in current subset
weight = max(weight_till_now,func(weight_till_now+wt[curr_pos],curr_pos+1))
// Not taking in current subset
weight = max(weight_till_now,func(weight_till_now,curr_pos+1))
return weight
Calling this function with initial parameters as 0,0 will give you the answer as this will make each and every subset and also try to get the maximum weight of all the possible subset weight and if it becomes greater than maximum possible weight then this will return 0.

insert same values into binomial heap

I have to insert some values into binomial heap. For example 25, 26, 24, 60, 65, 62. It will look like follows:
But then I have to insert 25, 68, 65 to the same heap. Should I insert 25 again or just skip it as it is already present in the heap?
It is up to your implementation and specific requirements. Do you need duplicate elements? A binomial heap can support inserting the same value multiple times and will perform just as good(if you implement it correctly), but this does not mean that it should in your case.

Why shouldn't radix sort be implemented starting with the MSD?

I'm reading Introduction to Algorithms by Cormen et. al. and there in the part where they describe Radix sort, they say:
Intuitively you might sort numbers on their most significant digit,
sort each of the resulting bins recursively and then combine the decks
in order. Unfortunately since the cards in 9 of the 10 bins must be
put aside to sort each of the bins, this procedure generates many
intermediate piles of cards that you'd have to keep a track of.
What does this mean?
I don't understand how sorting by the MSB would be a problem?
Consider the following example list to sort:
170, 045, 075, 090, 002, 024, 802, 066, 182, 332, 140, 144
Sorting by most significant digit (hundreds) gives:
Zero hundreds bucket: 045, 075, 090, 002, 024, 066
One hundreds bucket: 170, 182, 140, 144
Three hundreds bucket: 332
Eight hundreds bucket: 802
Sorting by next digit is now needed for numbers in the zero and one hundreds bucket (the other two buckets only contain one item each):
Zero tens: 002
Twenties: 024
Forties: 045
Sixties: 066
Seventies: 075
Nineties: 090
Sorting by least significant digit (1s place) is not needed, as there is no tens bucket with more than one number. That's not the case with the one hundreds bucket though (exercise: recursively sort it yourself). Therefore, the now sorted zero hundreds bucket is concatenated, joined in sequence, with the one, three and eight hundreds bucket to give:
002, 024, 045, 066, 075, 090, 140, 144, 170, 182, 332, 801
You can see that the authors are referring to the intermediate recursive sorting steps, which are not necessary in an LSD radix sort.
They refer to a useful property of an LSD radix sort that since you ensure each sorting step is stable, you only have to run one step for each digit, on the whole array, and you don't have to individually sort any subsets.
To take Michael's example data:
After 0 steps:
170, 045, 075, 090, 002, 024, 802, 066, 182, 332, 140, 144
After 1 step (sort on units):
170, 090, 140, 002, 802, 182, 332, 024, 144, 045, 075, 066
After 2 steps (sort on tens):
002, 802, 024, 332, 140, 144, 045, 066, 170, 075, 182, 090
After 3 steps (sort on hundreds):
002, 024, 045, 066, 075, 090, 140, 144, 170, 182, 332, 802
This property becomes especially useful if you're radix-sorting in binary rather than base 10. Then each sorting step is just a partition into two, which is very simple. At least, it is until you want to do it without using any extra memory.
MSD radix sort works, of course, it just requires more book-keeping and/or a non-tail recursion. It's only a "problem" in that CLRS (in common with other expert programmers) don't like to do fiddly work until it's necessary.

Resources