flatten and compact an array more efficiently - ruby

On many occasions, we need to perform two or more different operations on an array like flatten and compact.
some_array.flatten.compact
My concern here is that it will loop over the array two times. Is there more efficient way of doing this?

I actually think this is a great question. But first off, why is everyone not too concerned about this? Here's the performance of flatten and flatten.compact compared:
Here's the code I used to generate this chart, and one that includes memory.
Hopefully now you see why most folks won't worry: it is just another constant factor you're adding when you compose a flatten with a compact, maybe it's valuable at least theoretically to say: how can we shave off the time and space of this intermediate structure? Again, asymptotically not super valuable, but curious to think about.
As far as I can tell, you can't do this by making use of flatten:
Before looking at the source, I hoped that flatten could take a block like so:
[[3, [3, 3, 3]], [3, [3, 3, 3]], [3, [3, 3, 3]], nil].flatten {|e| e unless e.nil? }
No dice though. We get this as a return:
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, nil]
This is weird in that it basically tosses the block away as a no-op. But it makes sense with the source. The C method flatten used in Ruby's core isn't parameterized to take a block.
The procedure in the Ruby source code reads kinda weird to me (I am not a C programmer) but it's basically doing something like depth first search. It's using a stack that it adds every new nested array in to process that it encounters. (It terminates when none remain.) I've not calculated this formally, but it leads me to guess the complexity is on par with DFS.
So the source code could've been written such this would work by allowing for extra setup if a block is passed in. But without that, you're stuck with the (small) performance hit!

It is not iterating over the same array two times. flatten creates in general an array that has an entirely different structure from the original one. Therefore, the first and the second iteration are not iterating over the same elements. So, it naturally follows that you cannot do that.

If the array is one layer deep, then the arrays can be merged in a set.
require 'set'
s = Set.new
Ar.each{|a| s.merge(a)}

Related

Question regarding mergesort's merge algorithm

Let's suppose we have two sorted arrays, A and B, consisting of n elements. I dont understand why the time needed to merge these 2 is "n+n". In order to merge them we need to compare 2n-1 elements. For example, in the two following arrays
A = [3, 5, 7, 9] and B = [2, 4, 6, 8]
We will start merging them into a single one, by comparing the elements in the known way. However when we finally compare 8 with 9. Now, this will be our 2n-1=8-1=7th comparison and 8 will be inserted into the new array.
After this the 9 will be inserted without another comparison. So I guess my question is, since there are 2n-1 comparisons, why do we say that this merging takes 2n time? Im not saying O(n), im saying T(n)=2n, an exact time function.
Its probably a detail that im missing here so I would be very grateful if someone could provide some insight. Thanks in advance.

Algorithms for Optimization of Integer Subset Linking

Consider having two sets of integer values that are divided in multiple subsets. The two sets exist of the same set of values but the order and the division into subsets differ. The idea is to link the subsets from the first set with these from the second set in such way that every individual value in each subset of the first set is linked to a same individual value of a subset of the second set. No value can be linked with two others. In one linking step multiple values can be linked between only one subset of the first set with only one subset of the second set. The goal is to reduce the amount of linking steps as much as possible.
The question is: are there algorithms around for doing this kind of linking as optimal as possible?
I have done some research in several fields of mathematical optimization, such as Linear Programming, Integer Programming, Combinatorial optimization and Operations Research but none of the algorithms seem to cover this problem. Do you guys have any ideas, fields or algorithms to optimize these kinds of problems and make me head in the right direction?
For example:
Two sets of integers with two subsets:
[[1, 2, 2] [2, 3, 3]]
and
[[1, 2, 3] [2, 2, 3]].
Now the first linking set could be to link the first subset of the first set 1[1] with the first subset of the second set 2[1].
This is one step and leads to a link between: 1 - 1 - 1 and 2 - 1 - 1 and a link between 1 - 1 - 2 and 2 - 1 - 2. Now the sets will look like this:
[[1, 2, 2] [2, 3, 3]]
and
[[1, 2, 3] [2, 2, 3]].
The next step could be linking 1[1] with 2[2], leading to a link between 1 - 1 - 3 and 2 - 2 - 1 and the sets will look like this:
[[1, 2, 2] [2, 3, 3]]
and
[[1, 2, 3] [2, 2, 3]].
The third step could be linking 1[2] with 2[1]. Resulting in:
[[1, 2, 2] [2, 3, 3]]
and
[[1, 2, 3] [2, 2, 3]].
And the fourth step could then be linking 1[2] to 2[2]. Resulting in:
[[1, 2, 2] [2, 3, 3]]
and
[[1, 2, 3] [2, 2, 3]], which means every value is linked. This solution costs four steps.
When having larger sets, all subsets can be linked to all other subsets of the other set, but that will result in many steps. Is there a algorithm around that optimizes the number of steps?
Even this is not an answer, but I think this is a step in defining the problem toward finding a solution.
Note: The following example of input/output was an edit. I disagree with the rejecting votes, and I URGE everyone to read carefully before voting to approve or reject any edit.
This would open a discussion about the votes that are non-carefully casted. Still is a constructive discussion but here is not its place.
Consider the following example: It is less costly (less using of sub-sets) to use the 3nd sub-set of the first list than using the 2nd and 5th sub-sets.
The algorithm:
Define the smaller list: List #2
Create a counting list of all items in all sub-lists of list #2.
You will have this counting list {[item:count]}: {[1:3], [2:2], [3:1], [4:2], [5:1]}.
Now, your problem instead of linking (i.e. index-dependent) the sub-sets. It is to find the min number of sub-sets of list #1. That their items would give the count of the counting list.
A simple try of each possible combination would definitely get the answer.. but I think from point #4 we can think of a better solution containing some conditions to minimize the combination tries.
Hopefully, this suggestion would help in giving a hint towards finding a solution.

Algorithm for seeing if many different arrays are subsets of another one?

Let's say I have an array of ~20-100 integers, for example [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (actually numbers more like [106511349 , 173316561, ...], all nonnegative 64-bit integers under 2^63, but for demonstration purposes let's use these).
And many (~50,000) smaller arrays of usually 1-20 terms to match or not match:
1=[2, 3, 8, 20]
2=[2, 3, NOT 8]
3=[2, 8, NOT 16]
4=[2, 8, NOT 16] (there will be duplicates with different list IDs)
I need to find which of these are subsets of the array being tested. A matching list must have all of the positive matches, and none of the negative ones. So for this small example, I would need to get back something like [3, 4]. List 1 fails to match because it requires 20, and list 2 fails to match because it has NOT 8. The NOT can easily be represented by using the high bit/making the number negative in those cases.
I need to do this quickly up to 10,000 times per second . The small arrays are "fixed" (they change infrequently, like once every few seconds), while the large array is done per data item to be scanned (so 10,000 different large arrays per second).
This has become a bit of a bottleneck, so I'm looking into ways to optimize it.
I'm not sure the best data structures or ways to represent this. One solution would be to turn it around and see what small lists we even need to consider:
2=[1, 2, 3, 4]
3=[1, 2]
8=[1, 2, 3, 4]
16=[3, 4]
20=[1]
Then we'd build up a list of lists to check, and do the full subset matching on these. However, certain terms (often the more frequent ones) are going to end up in many of the lists, so there's not much of an actual win here.
I was wondering if anyone is aware of a better algorithm for solving this sort of problem?
you could try to make a tree with the smaller arrays since they change less frequently, such that each subtree tries to halve the number of small arrays left.
For example, do frequency analysis on numbers in the smaller arrays. Find which number is found in closest to half of the smaller arrays. Make that the first check in the tree. In your example that would be '3' since it occurs in half the small arrays. Now that's the head node in the tree. Now put all the small lists that contain 3 to the left subtree and all the other lists to the right subtree. Now repeat this process recursively on each subtree. Then when a large array comes in, reverse index it, and then traverse the subtree to get the lists.
You did not state which of your arrays are sorted - if any.
Since your data is not that big, I would use a hash-map to store the entries of the source set (the one with ~20-100 integers). That would basically let you test if a integer is present in O(1).
Then, given that 50,000(arrays) * 20(terms each) * 8(bytes per term) = 8 megabytes + (hash map overhead), does not seem large either for most systems, I would use another hash-map to store tested arrays. This way you don't have to re-test duplicates.
I realize this may be less satisfying from a CS point of view, but if you're doing a huge number of tiny tasks that don't affect each other, you might want to consider parallelizing them (multithreading). 10,000 tasks per second, comparing a different array in each task, should fit the bill; you don't give any details about what else you're doing (e.g., where all these arrays are coming from), but it's conceivable that multithreading could improve your throughput by a large factor.
First, do what you were suggesting; make a hashmap from input integer to the IDs of the filter arrays it exists in. That lets you say "input #27 is in these 400 filters", and toss those 400 into a sorted set. You've then gotta do an intersection of the sorted sets for each one.
Optional: make a second hashmap from each input integer to it's frequency in the set of filters. When an input comes in, sort it using the second hashmap. Then take the least common input integer and start with it, so you have less overall work to do on each step. Also compute the frequencies for the "not" cases, so you basically get the most bang for your buck on each step.
Finally: this could be pretty easily made into a parallel programming problem; if it's not fast enough on one machine, it seems you could put more machines on it pretty easily, if whatever it's returning is useful enough.

Stack and Queues: Which is simpler to implement using arrays?

I just got this question from a textbook exercise saying
"Stack and Queue ADT's can be implemented using array. Which one is simpler to implement using an array? Explain"
I think using an array is probably not the best way to implement both a stack and queue in the first place because of the fixed space in an array, unless it is resized after each overflow of item.
I do not have a perfect response to this but which one of them is simpler to implement using arrays?
The only difference that I can think of is that with a stack, you only have to keep track of the front of the stack in the array, while with a queue you will need of keep track of both the front and end of the queue.
"Keep track of" means "storing an array index/offset for".
Other than that, the standard operations on stacks and queues are fairly similar in number; push(), pop() for stacks, and enqueue(), dequeue() for queues, and neither data type is particularly complex or difficult to implement.
A stack would be better implemented as an array compared to a queue, mainly because of how the types of operations affect the array itself.
Queue
For a queue data structure, you need to be able to remove elements from one end and push elements into the other. When you have an array, adding or removing an element from the front of the array is relatively bad because it involves you having to shift every other element to accommodate the new one.
queue: [2, 3, 4, 5, 6]
enqueue: 1
queue: [1, 2, 3, 4, 5, 6] (every element had to shift to fit 1 in the front)
or if you oriented your queue the opposite way,
queue: [1, 2, 3, 4, 5, 6]
dequeue: 1
queue: [2, 3, 4, 5, 6] (every element had to shift when 1 was removed from the front)
So no matter which direction you orient your queue, you will always have some operation (enqueue or dequeue) which involves adding/removing an element from the front of the array, which in turn causes every other element to shift which is relatively inefficient (would be great to avoid, and is why most queues aren't implemented with an array).
Stack
With a stack data structure, you only need to add and remove elements from the same end. This allows us to avoid the problem we were having with adding/removing elements from the front of the array. We just need to orient our stack to add and remove elements from the back of the array, and we will not encounter the problem with having to shift all the elements when something is added or removed.
stack: [1, 2, 3, 4]
push: 5
stack: [1, 2, 3, 4, 5] (nothing had to be shifted)
pop:
stack: [1, 2, 3, 4] (nothing had to be shifted)
Yes, It is obvious that array is not the best to implement queue or stack in data structure for real life problems.
I think, Implementation of a stack is always easier than the implementation of a queue because in stack we just have to push the element on the highest index and pop the same element from the same index. And if we want to push another element than we will push it on the same index. Every operation performs on the same index.
But in the case of a queue, there are two indices to trace one from which we have to dequeue the element and another index for the operation enqueue.
We have to update the indices for their corresponding operations(i.e. front when deque and end when enqueue).

decoding via number combinations algorithm in python 3

ok so here is the problem.
let's say:
1 means Bob
2 means Jerry
3 means Tom
4 means Henry
any summation combination of two of aforementioned numbers is a status/ mood type which is how the program will be encoded:
7 (4+3) means Angry
5 (3+2) menas Sad
3 (2+1) means Mad
4 (3+1) means Happy
and so on...
how may i create a decode function such that it accepts one of the added (encoded) values, such as 7, 5, 3, 4, etc and figures out the combination and return the names of the people representing the two numbers that constitue the combination. take note that one number cannot be repeated to get mood result, meaning 4 has to be 3+1 and may not be 2+2. so we can assume for this example, that there is only one possible combination for each status/ mood code. now the problem is, how do you implement such code in python 3? what would be the algorithm or logic for such a problem. how do you seek or check for combination of two numbers? i'm thinking i should just run a loop that keeps on adding two numbers at a time until the result matches with the status/ mood code. will that work? BUT THIS METHOD WILL SOON BECOME OBSOLETE IF THE NUMBER OF COMBINATIONS IS INCREASED (as in adding 4 numbers together instead of 2). doing it this way will take up a lot of time and will possibly be inefficient.
i apologize, i know this questions is extremely confusing but please bear with me.
let's try and work something out.
Use Binary
If you want to have sums that are unique, then assign each possible "Person" a number that's a power of 2. The sum of any combination of these numbers will uniquely identify which numbers were used in the sum.
1, 2, 4, 8, 16, ...
Rather than offer a detailed proof of correctness, I offer an intuitive argument about this: any number can be represented in base 2, and it is always a sum of exactly one combination of powers of 2.
This solution may not be optimal. It has realistic limitations (32 or 64 different "person" identifiers, unless you use some sort of BigInt), but depending on your needs, it might work. Having the smallest possible values, binary is better than any other radix though.
Example
(Edited)
Here's a quick snippet that demonstrates how you could decode the sum. The returned values are the exponents of the powers of 2. count_persons could be arbitrarily large, as could the range of n iterated over (just as a quick example).
#!/usr/bin/python3
count_persons = 64
for n in range(20,30):
matches = list(filter(lambda i: (n>>i) & 0x1, range(1,count_persons)))
print('{0}: {1}'.format(n,matches))
Output:
20: [2, 4]
21: [2, 4]
22: [1, 2, 4]
23: [1, 2, 4]
24: [3, 4]
25: [3, 4]
26: [1, 3, 4]
27: [1, 3, 4]
28: [2, 3, 4]
29: [2, 3, 4]
See a more appropriate answer here
In my opinion, the selected answer is so suboptimal that it can be considered plain wrong.
The table you are building can be indexed with N(N-1)/2 values, while the binary approach uses 2N.
With a 64 bits unsigned integer, you could encode about sqrt(265) values, that is 6 billion names, compared with the 64 names the binary approach will allow.
Using a big number library could push the limit somewhat, but the computations involved would be hugely more costly than the simple o(N) reverse indexing algorithm needed by the alternative approach.
My conclusion is: the binary approach is grossly inefficient, unless you want to play with a handful of values, in which case hard-coding or precomputing the indexes would be just as good a solution.
Since the question is very unlikely to match a search on the subject, it is not that important anyway.

Resources