Hungarian assignment alternative with large and unbalanced profit matrix - algorithm

I need help with solving assignment problem in particular cases. In one scenario the profit matrix dimension 2000 by 23000 (2000 items and 23000 bins, where each bin can only contain one item, and there's no negative profit). If Hungarian assignment is applied, the algorithm will firstly create a square matrix of 23000 by 23000, and caused OutOfMemory exception.
The problem I want to solve is just what the maximum profit that the optimal assignment scheme can produce. Hence there is no need to output the actual optimal assignment, just the optimal value is needed. Also this value can just be an approximation. I wonder if there exists an alternative way which can save memory and computation cost.
Thanks in advance.

You can actually simulate all of the dummy columns with just one column.
I don't know what specific instructions you're following, but one of the steps in the algorithm is to cover every zero in the matrix using the fewest number of horizontal and vertical lines possible. At the start when all of the dummy columns are still zero, the most efficient way to cover them would be to cover them column-wise. Some rows may be covered as well, and the values that are covered twice are incremented by an amount. More importantly, the same row in every dummy column will be incremented by the same amount.
Continuing on now that some of the rows of the dummy columns are not zero, we reach this step again. Since all of the dummy columns are still identical, it follows that if it is efficient for one dummy column to be covered, they will all be covered. So even though the values may change, every dummy column will always be identical to every other dummy column, so you can represent them all using only one array.
You may still run into problems if you have a lot of real data, but this should help in situations like the one you mentioned.

Related

What kind of PRNG would match such scatter plots?

I've been given the challenge to find the seed from a series of pseudo-randomly generated alphanumerical IDs and after some analysis, I'm stuck in a dead end that I hope you'll be able to get me out of.
Each ID is obtained by passing the previous one through the encryption algorithm, that I'm supposed to reverse engineer in order to find the seed. The list given to me is composed of the 2070 first IDs (without the seed obviously). The IDs start as 4 alphanumerical characters, and switch to 5 after some time (e.g. "2xm1", "34nj", "avdfe", "2lgq9")
This switch happens once the algorithm, after encrypting an ID, returns an ID that has already been generated previously. At this point, it adds one character to this returned ID, making it longer and thus unique. It then proceeds as usual, generating IDs of the new length. This effectively means that the generation algorithm is surjective.
My first reflex was to try to convert those IDs from base36 to some other base, notably decimal. I used the results to scatter plot a chart of the IDs' decimal values in terms of their rank in the list, when I noticed a pattern that I couldn't understand the origin of.
After isolating the two parts of the list in terms of ID length, I scatter plotted the same graph for the 4-characters IDs sub-list and 5-characters IDs sub-list, allowing me to notice the strange density patterns.
After some analysis, I've observed 2 things :
For each sub-list, the delimitation between the 2 densities is 6x36^(n-1), n being the number of characters in the ID. In other terms, it is 1/6th of the entire range of values for a given ID length. The range of values is [0; (36^n)-1]
The repartition of those IDs in relation to this limit tends towards 50/50, half of them being above the 1/6th limit, half of them being under it.
I've tried to correlate such a behavior with other known PRNG scatter-plots, but none of them matched what I get on my graphs.
I'm hoping some of you might know about an encryption method, formula, or function matching such a specific scatter plot, or have any idea about what could be going on behind the scenes.
Thanks in advance for your answers.
This answer may not be very useful but I think it can help. the graph plot you shown is most likely that it doesn't belong to one of the most known PRNG used and of course it would never belong to cryptographic PRNG.
But I have a notice I dont know if it can help. This PRNG seems to have a full period equals to full cycle of numbers generated for a fixed character places. I mean it operate with a pattern for 4 digits then repeat pattern but with higher magnitude for 5 characters which will propably means that this same pattern of distribution will repeat for 6 characters but with higher magnitude.
So, in summery, this can mean that this pattern can be exploited if you know what is the value of this magnitude so you know the increments for 6 characters graph plot and then you can just stretch the 5 characters graph on the Y-Axis to get some kind of a solution (which would be the seed for 6 characters graph).
EDIT: To clear things more clearly regarding your comment. what I mean is that this PRNG generate random numbers but these random numbers would not be repeated to infinity instead there will be some point in time were the same sequence will be regenerated. The I've inadvertantly left behind a piece of information: confirm this since when it encounter same number generated before ( reached this point in time where same sequence is regenerated ). It will just add 1 extra character to the sequence which would not change the distribution on the graph but instead will make the graph appear like if it was stretched along Y-Axis (like if Y intercept of the graph function just got bigger).

Generating minimal/irreducible Sudokus

A Sudoku puzzle is minimal (also called irreducible) if it has a unique solution, but removing any digit would yield a puzzle with multiple solutions. In other words, every digit is necessary to determine the solution.
I have a basic algorithm to generate minimal Sudokus:
Generate a completed puzzle.
Visit each cell in a random order. For each visited cell:
Tentatively remove its digit
Solve the puzzle twice using a recursive backtracking algorithm. One solver tries the digits 1-9 in forward order, the other in reverse order. In a sense, the solvers are traversing a search tree containing all possible configurations, but from opposite ends. This means that the two solutions will match iff the puzzle has a unique solution.
If the puzzle has a unique solution, remove the digit permanently; otherwise, put it back in.
This method is guaranteed to produce a minimal puzzle, but it's quite slow (100 ms on my computer, several seconds on a smartphone). I would like to reduce the number of solves, but all the obvious ways I can think of are incorrect. For example:
Adding digits instead of removing them. The advantage of this is that since minimal puzzles require at least 17 filled digits, the first 17 digits are guaranteed to not have a unique solution, reducing the amount of solving. Unfortunately, because the cells are visited in a random order, many unnecessary digits will be added before the one important digit that "locks down" a unique solution. For instance, if the first 9 cells added are all in the same column, there's a great deal of redundant information there.
If no other digit can replace the current one, keep it in and do not solve the puzzle. Because checking if a placement is legal is thousands of times faster than solving the puzzle twice, this could be a huge time-saver. However, just because there's no other legal digit now doesn't mean there won't be later, once we remove other digits.
Since we generated the original solution, solve only once for each cell and see if it matches the original. This doesn't work because the original solution could be anywhere within the search tree of possible solutions. For example, if the original solution is near the "left" side of the tree, and we start searching from the left, we will miss solutions on the right side of the tree.
I would also like to optimize the solving algorithm itself. The hard part is determining if a solution is unique. I can make micro-optimizations like creating a bitmask of legal placements for each cell, as described in this wonderful post. However, more advanced algorithms like Dancing Links or simulated annealing are not designed to determine uniqueness, but just to find any solution.
How can I optimize my minimal Sudoku generator?
I have an idea on the 2nd option your had suggested will be better for that provided you add 3 extra checks for the 1st 17 numbers
find a list of 17 random numbers between 1-9
add each item at random location provided
these new number added dont fail the 3 basic criteria of sudoku
there is no same number in same row
there is no same number in same column
there is no same number in same 3x3 matrix
if condition 1 fails move to the next column or row and check for the 3 basic criteria again.
if there is no next row (ie at 9th column or 9th row) add to the 1st column
once the 17 characters are filled, run you solver logic on this and look for your unique solution.
Here are the main optimizations I implemented with (highly approximate) percentage increases in speed:
Using bitmasks to keep track of which constraints (row, column, box) are satisfied in each cell. This makes it much faster to look up whether a placement is legal, but slower to make a placement. A complicating factor in generating puzzles with bitmasks, rather than just solving them, is that digits may have to be removed, which means you need to keep track of the three types of constraints as distinct bits. A small further optimization is to save the masks for each digit and each constraint in arrays. 40%
Timing out the generation and restarting if it takes too long. See here. The optimal strategy is to increase the timeout period after each failed generation, to reduce the chance that it goes on indefinitely. 30%, mainly from reducing the worst-case runtimes.
mbeckish and user295691's suggestions (see the comments to the original post). 25%

sorting algorithm to keep sort position numbers updated

Every once in a while I must deal with a list of elements that the user can sort manually.
In most cases I try to rely on a model using an order sensitive container, however this is not always possible and resort to adding a position field to my data. This position field is a double type, therefore I can always calculate a position between two numbers. However this is not ideal, because I am concerned about reaching an edge case where I do not have enough numerical precision to continue inserting between two numbers.
I am having doubts about the best approach to maintain my position numbers. The first thought is traversing all the rows and give them a round number after every insertion, like:
Right after dropping a row between 2 and 3:
1 2 2.5 3 4 5
After position numbers update:
1 2 3 4 5 6
That of course, might get heavy if I have a high number of entries. Not specially in memory, but to store all new values back to the disk/database. I usually work with some type of ORM and mobile software. Updating all the codes will pull out of disk every object and will set them as dirty, leading to a re-verification of all the related validation rules of my data model.
I could also wait until the precision is not enough to calculate a number between two positions. However the user experience would be bad, since the same operation will no longer require the same amount of time.
I believe that there is an standard algorithm for these cases that regularly and consistently keep the position numbers updated, or just some of them. Ideally it should be O(log n), with no big time differences between the worst and best cases.
Being honest I also think that anything that must be user/sorted, cannot grow as large as to become a real problem in its worst case. The edge case seems also to be extremely rare, even more if I search a solution pushing the border numbers. However I still believe that there is an standard well known solution for this problem which I am not aware of, and I would like to learn about it.
Second try.
Consider the full range of position values, say 0 -> 1000
The first item we insert should have a position of 500. Our list is now :
(0) -> 500 -> (1000).
If you insert another item at first position, we end up with :
(0) -> 250 -> 500 -> (1000).
If we keep inserting items at first position, we gonna have a problem, as our ranges are not equally balanced and... Wait... balanced ? Doesn't it sounds like a binary tree problem !?
Basically, you store your list as a binary tree. When inserting a node, you assign it a position according to surrounding nodes. When your tree become unbalanced, you rotate nodes to make it balanced again and you recompute position for rotated nodes !
So :
Most of the time, adding a node will not require to change position of other nodes.
When balancing is required, only a subset of your items will be changed.
It's O(log n) !
EDIT
If the user is actually sorting the list manually, then is there really any need to worry about taking O(n) to record the new order? It's O(n) in any case just to display the list to the user.
This not really answers the question but...
As you talked about "adding a position field to your data", I suppose that your data store is a relational database and that your data has some kind of identifier.
So maybe you can implement a doubly linked list by adding a previous_data_id and next_data_id to your data. Insert/move/remove operations thus are O(1).
Loading such a collection from a database is rather easy:
Fetch each item and add them to a map with their id as key.
For each item connect it with its previous and next item.
Starting with the first item (previous_data_id is undefined) follow the chain and add them to a list.
After some days with no valid answer. This is my theory:
The real challenge here is a practical solution. Maybe there is a mathematical correct solution, but every day that goes by, it seems that the implementation would be of a great complexity. A good solution should not only be mathematically correct, but also balanced with the nature the problem, the low chances to meet it, and its minor implications. Like how useless it could be killing flies with bullets, although extremely effective.
I am starting to believe that a good answer could be: to the hell with the right solution, leave it like one line calculation and live with the rare case where sorting of two elements might fail. It is not worth to increase complexity and invest time or money in such nity-picky problem, so rare, that causes no data damage, just a temporal UX glitch.

3x3 grid puzzle solving (JS)

I have an image separated into a 3x3 grid. The grid is represented by an array. Each column or row can be rotated through. E.g, the top row [1,2,3] could become [3,1,2] etc.
The array needs to end up as:
[1,2,3]
[4,5,6]
[7,8,9]
And would start from something like:
[5,3,9]
[7,1,4]
[8,6,2]
It will always be solvable, so this doesn't need to be checked.
I've tried a long handed approach of looking for '1' and moving left then up to its correct place, and so on for 2, 3,... but end up going round in circles for ever.
Any help would be appreciated, even if you can just give me a starting point/reference... I can't seem to think through this one.
your problem is that the moves to shift one value will mess up others. i suspect with enough set theory you can work out an exact solution, but here's a heuristic that has more chance of working.
first, note that if every number in a row belongs to that row then it's either trivial to solve, or some values are swapped. [2,3,1] is trivial, while [3,2,1] is swapped, for example.
so an "easier" target than placing 1 top left is to get all rows into that state. how might we do that? let's look at the columns...
if the column contains one number from each row, then we are in a similar state to above (it's either trivial to shift so numbers are in the correct rows, or it's swapped).
so, what i would suggest is:
for column in columns:
if column is not one value from each row:
pick a value from column that is from a duplicate row
rotate that row
for column in columns:
as well as possible, shift until each value is in correct row
for row in rows:
as well as possible, shift until each value is in correct column
now, that is not guaranteed to work, although it will tend to get close, and can solve some set of "almost right" arrangements.
so what i would then do is put that in a loop and, on each run, record a "hash" of the state (for example, a string containing the values read row by row). and then on each invocation if i detect (by checking if the hash was one we had seen already) that the state has already occurred (so we are repeating ourselves) i would invoke a "random shuffle" that mixes things up.
so the idea is that we have something that has a chance of working once we are close, and a shuffle that we resort to when that gets stuck in a loop.
as i said, i am sure there are smarter ways to do this, but if i were desperate and couldn't find anything on google, that's the kind of heuristic i would try... i am not even sure the above is right, but the more general tactic is:
identify something that will solve very close solutions (in a sense, find out where the puzzle is "linear")
try repeating that
shuffle if it repeats
and that's really all i am saying here.
Since the grid is 3x3, you can not only find the solution, but find the smallest number of moves to solve the problem.
You would need to use Breadth First Search for this purpose.
Represent each configuration as a linear array of 9 elements. After each move, you reach a different configuration. Since the array is essentially a permutation of numbers between 1-9, there would be only 9! = 362,880 different configurations possible.
If we consider each configuration as a node, and making each move is considered as taking an edge, we can explore the entire graph in O(n), where n is the number of configurations. We need to make sure, that we do not re-solve a configuration which we have already seen before, so you would need a visited array, which marks each configuration visited as it sees it.
When you reach the 'solved' configuration, you can trace back the moves taken by using a 'parent' array, which stores the configuration you came from.
Also note, if it had been a 4x4 grid, the problem would have been quite intractable, since n would equal (4x4)! = 16! = 2.09227899 × 10^13. But for smaller problems like this, you can find the solution pretty fast.
Edit:
TL;DR:
Guaranteed to work, and pretty fast at that. 362,880 is a pretty small number for today's computers
It will find the shortest sequence of moves.

Genetic Algorithm Implementation for weight optimization

I am a data mining student and I have a problem that I was hoping that you guys could give me some advice on:
I need a genetic algo that optimizes the weights between three inputs. The weights need to be positive values AND they need to sum to 100%.
The difficulty is in creating an encoding that satisfies the sum to 100% requirement.
As a first pass, I thought that I could simply create a chrom with a series of numbers (ex.4,7,9). Each weight would simply be its number divided by the sum of all of the chromosome's numbers (ex. 4/20=20%).
The problem with this encoding method is that any change to the chromosome will change the sum of all the chromosome's numbers resulting in a change to all of the chromosome's weights. This would seem to significantly limit the GA's ability to evolve a solution.
Could you give any advice on how to approach this problem?
I have read about real valued encoding and I do have an implementation of a GA but it will give me weights that may not necessarily add up to 100%.
It is mathematically impossible to change one value without changing at least one more if you need the sum to remain constant.
One way to make changes would be exactly what you suggest: weight = value/sum. In this case when you change one value, the difference to be made up is distributed across all the other values.
The other extreme is to only change pairs. Start with a set of values that add to 100, and whenever 1 value changes, change another by the opposite amount to maintain your sum. The other could be picked randomly, or by a rule. I'd expect this would take longer to converge than the first method.
If your chromosome is only 3 values long, then mathematically, these are your only two options.

Resources