Distribute numbers to two "containers" and minimize their difference of sum - algorithm

Suppose there are n numbers let says we have the following 4 numbers 15,20,10,25
There are two container A and B and my job is to distribute numbers to them so that the sum of the number in each container have the least difference.
In the above example, A should have 15+20 and B should have 10+ 25. So difference = 0.
I think of a method. It seems to work but I don't know why.
Sort the number list in descending order first. In each round, take the maximum number out
and put to the container have less sum.
Btw, is it can be solved by DP?
THX

In fact, your method doesn't always work. Think about that 2,4,4,5,5.The result by your method will be (5,4,2)(5,4), while the best answer is (5,5)(4,4,2).
Yes, it can be solved by Dynamical Programming.Here are some useful link:
Tutorial and Code: http://www.cs.cornell.edu/~wdtseng/icpc/notes/dp3.pdf
A practice: http://people.csail.mit.edu/bdean/6.046/dp/ (then click Balanced Partition)
What's more, please note that if the scale of problem is damn large (like you have 5 million numbers etc.), you won't want to use DP which needs a too huge matrix. If this is the case, you want to use a kind of Monte Carlo Algorithm:
divide n numbers into two groups randomly (or use your method at this step if you like);
choose one number from each group, if (to swap these two number decrease the difference of sum) swap them;
repeat step 2 until "no swap occurred for a long time".
You don't want to expect this method could always work out with the best answer, but it is the only way I know to solve this problem at very large scale within reasonable time and memory.

Related

Most efficient algorithm for this problem

Problem: Let's say I have an array of size N, in range [1, N].
I want to generate a random number k such that 1 <= k <= N, and occupy
arr[k] with a random letter. I will get another random number p, that is different from k,
randomly and occupy that slot. I'll repeat this over and over until there are no slots left. Also, there might be a chance where one of these randomly generated letters might be removed from a slot they previously occupied.
The problem here is that every time I pick a new number, it'll be harder for me to guess a random number that wasn't occupied yet.
I'm trying to find a solution for this problem, obviously in the most efficient way possible (both space and complexity).
I've come up with several solutions but I don't think they're too efficient and I'm having a hard time calculating their complexity.
Plan A) For each integer I randomly choose in the range [1, N], I check if it's occupied. If it is, I re-roll until I get one that isn't occupied. This becomes inefficient with high orders of N as the collisions are pretty high.
Plan B) For each iteration I go over all values of the array, those who I don't occupy I write down in a list. Afterwards I shuffle the list (via Fisher-Yates shuffle for example?) and get arbitrarily the first object. This isn't space efficient, and for my problem, I can't just keep one list and shuffle from there, since in my program there might be multiple threads calculating this.
Plan C) A slight change of plan A: I randomly choose in the range [1, N], I check if it's occupied. If it is, I +1 and try the next one, if it is, again +1. Worst case all array slots are occupied except one -> O(N).
Is there a better way to approach this problem? This specific problem is a very important module in my software so time complexity is vital. If there isn't, which way do you think I should go with (For those who have talent on calculating time complexity). Thank you!
I would suggest that you use both plans A and B.
Use A until the array is mostly filled. Then search for the unused indices, put them into an array, and use plan B. You will have to experiment with your constraints to figure out when to switch.
Your comment about multiple threads doing this is concerning though. Be aware that when multiple threads access the same memory, race conditions are easy and contention for access slows you down.

Finding the average of large list of numbers

Came across this interview question.
Write an algorithm to find the mean(average) of a large list. This
list could contain trillions or quadrillions of number. Each number is
manageable in hundreds, thousands or millions.
Googling it gave me all Median of Medians solutions. How should I approach this problem?
Is divide and conquer enough to deal with trillions of number?
How to deal with the list of the such a large size?
If the size of the list is computable, it's really just a matter of how much memory you have available, how long it's supposed to take and how simple the algorithm is supposed to be.
Basically, you can just add everything up and divide by the size.
If you don't have enough memory, dividing first might work (Note that you will probably lose some precision that way).
Another approach would be to recursively split the list into 2 halves and calculating the mean of the sublists' means. Your recursion termination condition is a list size of 1, in which case the mean is simply the only element of the list. If you encounter a list of odd size, make either the first or second sublist longer, this is pretty much arbitrary and doesn't even have to be consistent.
If, however, you list is so giant that its size can't be computed, there's no way to split it into 2 sublists. In that case, the recursive approach works pretty much the other way around. Instead of splitting into 2 lists with n/2 elements, you split into n/2 lists with 2 elements (or rather, calculate their mean immediately). So basically, you calculate the mean of elements 1 and 2, that becomes you new element 1. the mean of 3 and 4 is your new second element, and so on. Then apply the same algorithm to the new list until only 1 element remains. If you encounter a list of odd size, either add an element at the end or ignore the last one. If you add one, you should try to get as close as possible to your expected mean.
While this won't calculate the mean mathematically exactly, for lists of that size, it will be sufficiently close. This is pretty much a mean of means approach. You could also go the median of medians route, in which case you select the median of sublists recursively. The same principles apply, but you will generally want to get an odd number.
You could even combine the approaches and calculate the mean if your list is of even size and the median if it's of odd size. Doing this over many recursion steps will generate a pretty accurate result.
First of all, this is an interview question. The problem as stated would not arise in practice. Also, the question as stated here is imprecise. That is probably deliberate. (They want to see how you deal with solving an imprecisely specified problem.)
Write an algorithm to find the mean(average) of a large list.
The word "find" is rubbery. It could mean calculate (to some precision) or it could mean estimate.
The phrase "large list" is rubbery. If could mean a list or array data structure in memory, or the "list" could be the result of a database query, the contents of a file or files.
There is no mention of the hardware constraints on the system where this will be implemented.
So the first thing >>I<< would do would be to try to narrow the scope by asking some questions of the interviewer.
But assuming that you can't, then a complete answer would need to cover the following points:
The dataset probably won't fit in memory at the same time. (But if it does, then that is good.)
Calculating the average of N numbers is O(N) if you do it serially. For N this size, it could be an intractable problem.
An alternative is to split into sublists of equals size and calculate the averages, and the average of the averages. In theory, this gives you O(N/P) where P is the number of partitions. The parallelism could be implemented with multiple threads, with multiple processes on the same machine, or distributed.
In practice, the limiting factors are going to be computational, memory and/or I/O bandwidth. A parallel solution will be effective if you can address these limits. For example, you need to balance the problem of each "worker" having uncontended access to its "sublist" versus the problem of making copies of the data so that that can happen.
If the list is represented in a way that allows sampling, then you can estimate the average without looking at the entire dataset. In fact, this could be O(C) depending on how you sample. But there is a risk that your sample will be unrepresentative, and the average will be too inaccurate.
In all cases doing calculations, you need to guard against (integer) overflow and (floating point) rounding errors. Especially while calculating the sums.
It would be worthwhile discussing how you would solve this with a "big data" platform (e.g. Hadoop) and the limitations of that approach (e.g. time taken to load up the data ...)

Optimized Algorithm: Fastest Way to Derive Sets

I'm writing a program for a competition and I need to be faster than all the other competitors. For this I need a little algorithm help; ideally I'd be using the fastest algorithm.
For this problem I am given 2 things. The first is a list of tuples, each of which contains exactly two elements (strings), each of which represents an item. The second is an integer, which indicates how many unique items there are in total. For example:
# of items = 3
[("ball","chair"),("ball","box"),("box","chair"),("chair","box")]
The same tuples can be repeated/ they are not necessarily unique.) My program is supposed to figure out the maximum number of tuples that can "agree" when the items are sorted into two groups. This means that if all the items are broken into two ideal groups, group 1 and group 2, what are the maximum number of tuples that can have their first item in group 1 and their second item in group 2.
For example, the answer to my earlier example would be 2, with "ball" in group 1 and "chair" and "box" in group 2, satisfying the first two tuples. I do not necessarily need know what items go in which group, I just need to know what the maximum number of satisfied tuples could be.
At the moment I'm trying a recursive approach, but its running on (n^2), far too inefficient in my opinion. Does anyone have a method that could produce a faster algorithm?
Thanks!!!!!!!!!!
Speed up approaches for your task:
1. Use integers
Convert the strings to integers (store the strings in an array and use the position for the tupples.
String[] words = {"ball", "chair", "box"};
In tuppls ball now has number 0 (pos 0 in array) , chair 1, box 2.
comparing ints is faster than Strings.
2. Avoid recursion
Recursion is slow, due the recursion overhead.
For example look at binarys search algorithm in a recursive implementatiion, then look how java implements binSearch() (with a while loop and iteration)
Recursion is helpfull if problems are so complex that a non recursive implementation is to complex for a human brain.
An iterataion is faster, but not in the case when you mimick recursive calls by implementing your own stack.
However you can start implementing using a recursiove algorithm, once it works and it is a suited algo, then try to convert to a non recursive implementation
3. if possible avoid objects
if you want the fastest, the now it becomes ugly!
A tuppel array can either be stored in as array of class Point(x,y) or probably faster,
as array of int:
Example:
(1,2), (2,3), (3,4) can be stored as array: (1,2,2,3,3,4)
This needs much less memory because an object needs at least 12 bytes (in java).
Less memory becomes faster, when the array are really big, then your structure will hopefully fits in the processor cache, while the objects array does not.
4. Programming language
In C it will be faster than in Java.
Maximum cut is a special case of your problem, so I doubt you have a quadratic algorithm for it. (Maximum cut is NP-complete and it corresponds to the case where every tuple (A,B) also appears in reverse as (B,A) the same number of times.)
The best strategy for you to try here is "branch and bound." It's a variant of the straightforward recursive search you've probably already coded up. You keep track of the value of the best solution you've found so far. In each recursive call, you check whether it's even possible to beat the best known solution with the choices you've fixed so far.
One thing that may help (or may hurt) is to "probe": for each as-yet-unfixed item, see if putting that item on one of the two sides leads only to suboptimal solutions; if so, you know that item needs to be on the other side.
Another useful trick is to recurse on items that appear frequently both as the first element and as the second element of your tuples.
You should pay particular attention to the "bound" step --- finding an upper bound on the best possible solution given the choices you've fixed.

An algorithm to sort a list of values into n groups so that the sum of each group is as close as possible

Basically I have a number of values that I need to split into n different groups so that the sums of each group are as close as possible to the sums of the others? The list of values isn't terribly long so I could potentially just brute force it but I was wondering if anyone knows of a more efficient method of doing this. Thanks.
If an approximate solution is enough, then sort the numbers descendingly, loop over them and assign each number to the group with the smallest sum.
groups = [list() for i in range(NUM_GROUPS)]
for x in sorted(numbers, reverse=True):
mingroup = groups[0]
for g in groups:
if sum(g) < sum(mingroup):
mingroup = g
mingroup.append(x)
This problem is called "multiway partition problem" and indeed is computationally hard. Googling for it returned an interesting paper "Multi-Way Number Paritioning where the author mentions the heuristic suggested by larsmans and proposes some more advanced algorithms. If the above heuristic is not enough, you may have a look at the paper or maybe contact the author, he seems to be doing research in that area.
Brute force might not work out as well as you think...
Presume you have 100 variables and 20 groups:
You can put 1 variable in 20 different groups, which makes 20 combinations.
You can put 2 variables in 20 different groups each, which makes 20 * 20 = 20^2 = 400 combinations.
You can put 3 variables in 20 different groups each, which makes 20 * 20 * 20 = 20^3 = 8000 combinations.
...
You can put 100 variables in 20 different groups each, which makes 20^100 combinations, which is more than the minimum number of atoms in the known universe (10^80).
OK, you can do that a bit smarter (it doesn't matter where you put the first variable, ...) to get to something like Branch and Bound, but that will still scale horribly.
So either use a fast deterministic algorithm, like larsman proposes.
Or if you need a more optimal solution and have the time to implement it, take a look at metaheuristic algorithms and software that implement them (such as Drools Planner).
You can sum the numbers and divide by the number of groups. This gives you the target value for the sums. Sort the numbers and then try to get subsets to add up to the required sum. Start with the largest values possible, as they will cause the most variability in the sums. Once you decide on a group that is not the optimal sum (but close), you could recompute the expected sum of the remaining numbers (over n-1 groups) to minimize the RMS deviation from optimal for the remaining groups (if that's a metric you care about). Combining this "expected sum" concept with larsmans answer, you should have enough information to arrive at a fast approximate answer. Nothing optimal about it, but far better than random and with a nicely bounded run time.
Do you know how many groups you need to split it into ahead of time?
Do you have some limit to the maximum size of a group?
A few algorithms for variations of this problem:
Knuth's word-wrap algorithm
algorithms minimizing the number of floppies needed to store a set of files, but keeping any one file immediately readable from the disk it is stored on (rather than piecing it together from fragments stored on 2 disks) -- I hear that "copy to floppy with best fit" was popular.
Calculating a cutting list with the least amount of off cut waste. Calculating a cutting list with the least amount of off cut waste
What is a good algorithm for compacting records in a blocked file? What is a good algorithm for compacting records in a blocked file?
Given N processors, how do I schedule a bunch of subtasks such that the entire job is complete in minimum time? multiprocessor scheduling.

How do you determine the best, worse and average case complexity of a problem generated with random dice rolls?

There is a picture book with 100 pages. If dice are rolled randomly to select one of the pages and subsequently rerolled in order to search for a certain picture in the book -- how do I determine the best, worst and average case complexity of this problem?
Proposed answer:
best case: picture is found on the first dice roll
worst case: picture is found on 100th dice roll or picture does not exist
average case: picture is found after 50 dice rolls (= 100 / 2)
Assumption: incorrect pictures are searched at most one time
Given your description of the problem, I don't think your assumption (that incorrect pictures are only "searched" once) sounds right. If you don't make that assumption, then the answer is as shown below. You'll see the answers are somewhat different from what you proposed.
It is possible that you could be successful on the first attempt. Therefore, the first answer is 1.
If you were unlucky, you could keep rolling the wrong number forever. So the second answer is infinity.
The third question is less obvious.
What is the average number of rolls? You need to be familiar with the Geometric Distribution: the number of trials needed to get a single success.
Define p as the probability for a successful trial; p=0.01.
Let Pr(x = k) be the probability that the first successful trial being the k th. Then we're going to have to have (k-1) failures and one success. So Pr(x=k) = (1-p)^(k-1) * p. Verify that this is the "probability mass function" on the wiki page (left column).
The mean of the geometric distribution is 1/p, which is therefore 100. This is the average number of rolls required to find the specific picture.
(Note: We need to consider 1 as the lowest possible value, rather than 0, so use the left hand column of the table on the Wikipedia page.)
To analyze this, think about what the best, worst and average cases actually are. You need to answer three questions to find those three cases:
What is the fewest number of rolls to find the desired page?
What is the largest number of rolls to find the desired page?
What is the average number of rolls to find the desired page?
Once you find the first two, the third should be less tricky. If you need asymptotic notation as opposed to just the number of rolls, think about how the answers to each question change if you change the number of pages in the book (e.g. 200 pages vs 100 pages vs 50 pages).
The worst case is not the page found after 100 dice rolls. That would be is your dice always returned different numbers. The worst case is that you never find the page (the way you stated the problem).
The average case is not average of the best and worst cases, fortunately.
The average case is:
1 * (probability of finding page on the first dice roll)
+ 2 * (probability of finding page on the second dice roll)
+ ...
And yes, the sum is infinite, since in thinking about the worst case we determined that you may have an arbitrarily large number of dice rolls. It doesn't mean that it can't be computed (it could mean that, but it doesn't have to).
The probability of finding the page on the first try is 1/100. What's the probability of finding it on the second dice roll?
You're almost there, but (1 + 2 + ... + 100)/100 isn't 50.
It might help to observe that your random selection method is equivalent to randomly shuffling the whole deck and then searching it in order for your target. Each position is equally likely, so the average is straightforward to calculate. Except of course that you aren't doing all that work up front, just as much as is needed to generate each random number and access the corresponding element.
Note that if your book were stored as a linked list, then the cost of moving from each randomly-selected page to the next selection depends on how far apart they are, which will complicate the analysis quite a lot. You don't actually state that you have constant-time access, and it's possibly debatable whether a "real book" provides that or not.
For that matter, there's more than one way to choose random numbers without repeats, and not all of them have the same running time.
So, you'd need more detail in order to analyse the algorithm in terms of anything other than "number of pages visited".

Resources