Finding minimum subset of objects with attributes. - algorithm

I have algorithmic problem. I don't know how to solve it. Maybe someone can help me?
I have objects. Each object has the same features. It could be illustrated in table:
Feature1 Feature2 Feature3 Feature4
Object1 1 0 1 1
Object2 0 0 0 1
Object3 0 1 1 1
Object4 0 1 0 0
Now I want to find all minimum subsets of objects. Each subset should have at least one value "1" for each feature. For table above results are two subsets: {Object1, Object3} and {Object1, Object4}.
I cannot generate all possible subsets because it could take too much time.

This is exactly the set cover problem. The problem is NP-hard so if you need the exact minimum, generating all possible subsets would not be so much worse than other solutions in time.
But there are some polynomial time approximation algorithms. See the Wikipedia page for detail. The "best" is the greedy algorithm, which runs like this:
Initialize the unimplemented features as {Feature1, Feature2, Feature3, ...}
Select the object which implements most unimplemented features.
Repeat 2 until all features are implemented.

You can reduce the problem by including by necessity all objects that are the sole possessor of a given feature (or features). Object1 is the only one that has Feature1 so you know you need that one in your solution.

Related

Find the configuration of on/off switches that maximizes number of light bulbs turned on

I have a collection $C$ of unique binary 1-dimensional vectors of length $n$, so there are $2^{n-1}$ elements in this collection excluding the one with all $0$'s. It's not exactly a powerset because all items are the same length, but I used a powerset to create this and appended $0$'s in an appropriate way. For example, when $n = 3$,
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
Each element has an associated reward function $R$ where $R(v) \ge 0$ for $v \in C$. Also, it's possible $R(v) \ge n$.
These are my current observations on $R$ (percentages are not exact, but representative):
$R(v) = 0$ like $80$% of the time
$R(v) = 10$ like $19$% of
the time
$R(v) > 10$ like $1$% of the time
In short, most of the items in this collection will not provide a useful reward. The rewards are not random, but I don't know how to find out what distribution they follow. What is the best way to maximize this reward without having to go through all 2^n-1 elements?
Let me rephrase the question with an example.
I think of $C$ as a collection of on/off switch configurations and each configuration turns on a number of light bulbs (one switch != one light bulb, for example, 0010 can turn on 0 light bulbs, and 0110 can turn on 5 lightbulbs). What's the fastest way to find the maximum number of light bulbs turned on if
hardly any of these configurations will turn on a light-bulb,
very few light bulbs turn on more than 10 light bulbs, but there are a few configurations that can turn on 100's of light bulbs .
I tried using a variation of Viterbi (using a cumulative sum of rewards instead of probabilities) in which I keep track of how changing a single element will yield a maximal result, but I think an implementation of Branch and Bound or a decision tree would be better in this case. A genetic algorithm would not work because the distribution of the rewards is too biased towards 0 and 10. And a brute-force approach for one at a time would take forever if $n$ is large.
Any suggestions? I will have to code this up but I want to get a feel for a solution.

Find Minimum Time to Occupy Grid

Problem:
Consider a patient suffering from skin infection and germs are spreading all over rapidly. Assume that skin surface is scaled as a rectangular grid of size MxN and cells are marked by 0 and 1 where 0 represents non affected region on skin and 1 represents affected region on skin. Germs can move from one cell of grid to another in 4 possible directions (right, left, up, down) but can move to only one cell at a time in one direction and affect that cell in 1 sec. Doctor currently who is treating the patient see's status and wants to know the time left for him to save him before the germs spread all over the skin and patient dies. Can you help to estimate the minimum time taken for the germs to completely occupy skin surface?
Input: : Current status of skin. (A matrix of size MxN with 1's and 0's which represents affected and non affected area)
Output: : Min time in sec to cover all over the grid.
Example:
Input:
[1 1 0 0 1]
[0 1 1 0 0]
[0 0 0 0 1]
[0 1 0 0 0]
Output: 2 seconds
Explanation:
After 1 sec from input, matrix could be as below
[1 1 1 0 1]
[1 1 1 0 1]
[0 1 1 0 1]
[0 1 1 0 1]
In next sec, matrix is completely filled by 1's
I will not present a detailed solution here, but some thoughts that hopefully may help you to write your own program.
First step is to determine the kind of algorithm to implement. The optimal way would be to find a simple and fast ad hoc solution for this problem. In the absence of such a solution, for this kind of problems, classical candidates are DFS, BFS, A* ...
As the goal is to find the shortest solution, it seems natural to consider BFS first, as once BFS finds a solution, we know that it is the shortest ones and we can stop the search. However, then, we have to consider avoiding inflation of the nodes, as it would lead not only to a huge calculation time, but also a huge memory.
First idea to avoid node inflation is to consider that some 1 cells can only be expended in one another cell. In the posted diagram, for example the cell (0, 0) (top left) can only be expended to cell (1, 0). Then, after this expansion, cell (1, 1) can only move to cell (2, 1). Therefore, we know it would be suboptimal to move cell (1,1) to cell (1,0). Therefore: move such cells first.
In a similar way, once an infected cell is surrounded by other infected cells only, it is no longer necessary to consider it for next moves.
At the end, it would be convenient to have a list of infected cells, together with the number of non-infected cells that each such cell can move to.
Another idea to limit the number of nodes is to detect duplicates, as it is likely here that many of them will exist. For that, we have to define a kind of hashing. The used hash function does not need to be 100% efficient, but need to be calculated rapidly, and if possible in a recursive manner. If we obtain B diagram from A diagram by adding a 1-cell at position (i, j), then I propose something like
H(B) = H(A)^f(i, j)
f(i, j) = a*(1024*i+j)%b
Here, I used the fact that N and M are less than 1000.
Each time a new diagram is consider, we have to calculate the corresponding H value and check if it exists already in the set of past diagrams.
I'm not sure how far I would get with this in an interview situation. After some thought, rather than considering solutions that store more than one full board state, I would rather consider a greedy priority queue since a strong heuristic for the next zero-cell candidates to fill seems to be:
(1) healthy cells that have the least neighbouring infected cells (but at least one, of course),
e.g., choose A over B
1 1 B 0 1
0 1 1 0 0
0 0 A 0 1
0 1 0 0 0
and (2) break ties by choosing first the healthy cells that when infected will block the least infected cells.
e.g., choose A over B
1 1 1 0 1
1 B 1 0 A
0 0 0 0 1
0 1 0 0 0
An interesting observation is that any healthy cell destination can technically be reached in time Manhattan-distance from the nearest infected cell, where the cell leading such a "crawl" continually chooses the single move that brings us closer to the destination. We know that at the same time, though, this same infected-cell "snake" produces new "crawlers" that could reach any equally far or closer neighbours. This makes me wonder if there may be a more efficient way to determine the lower-bound, based on counts of the farthest healthy cells.
This is a variant of the multi-agent pathfinding problem (MAPF). There is a ton of recent work on this topic, but earlier modern work is a good starting point for finding optimal solutions to this problem - for instance the operator decomposition approach.
To do this you would order the agents (germs) 1..k. Then, you would start a search where you generate all possible first moves for germ 1, followed by all possible first moves for germ 2, and so on, where moves for an agent are to stay in place, or to spread to an adjacent unoccupied location. With 4 possible actions for each germ, there are up to 4^k possible actions between complete states. (Partial states occur when you haven't yet assigned actions to all k agents.) The number of actions is exponential, meaning you may run up against resource constraints (time or space) fairly quickly. But, there are only 2^(MxN) states possible. (Since agents don't go away, it's actually 2^(MxN-i) where i is the number of initial germs.)
Every time all (k) germs have considered a possible action, you have a new complete state. (And k then increases for the next iteration.) The minimum time left comes from the shallowest complete state which has the grid filled. A bit of brute-force computation will find the shortest solution. (Quite a bit in the case of large grids.)
You could use a BFS to find the first state that is completely filled. But, A* might do much better. As a heuristic, you could consider that all adjacent locations of all cells were filled in each step, and then compute the number of steps required to fill the grid under that model. That gives a lower bound on the time required to fill the full grid.
But, there are many more optimizations. The reason to do operator decomposition is that you could order the moves to take the best moves first and not consider the weaker possibilities (eg all germs don't spread). You could also use a partial-expansion approach (EPEA*) to avoid generating a lot of clearly suboptimal policies for the germs.
If I was asking this as an interview questions I might be looking to see someone formulate the problem (what are actions, what are states), come up with the lower bound on the solution (every germ expands to every adjacent cell), come up with an algorithm, and perhaps analyze how hard the problem is, in order of increasing difficulty.

Distributing limited resources with choices

I'm looking for an algorithm to figure out the maximum number of "products" that can be created given limited "resource choices".
For example, let's say I want to create an ABC and are given these resource choices:
A or B: 3
A or C: 1
B: 1
C: 1
In this case, I can create at most 2 ABC by selecting 2 A and 1 B from the first choice, and 1 C from the second choice. Then I have a total of 2 A, 2 B, and 2 C which can create 2 ABC.
Is there an algorithm, other than brute forcing the permutations, which solves this problem?
For completeness, here are the constraints in my actual problem:
Max number of different resources in product: 10
Max number of resource choices: 20
Max quantity of a choice: 50
It can be solved using Linear Programming or using the Ford-Fulkerson Algorithm to solve the so-called Maximum Network Flow problem. It would take me quiet some time to try to explain any of the 2 mentioned algorithms, so I think you should take a look at some online resources. If you need to solve some real problem I suggest you to go through the algorithm(s) just to get idea how you can model this particular problem and then use some of the existing libraries. If you want to learn algorithms, feel free to write your implementation :)

Combinations of binary features (vectors)

The source data for the subject is an m-by-n binary matrix (only 0s and 1s are allowed).
m Rows represent observations, n columns - features. Some observations are marked as targets which need to be separated from the rest.
While it looks like a typical NN, SVM, etc problem, I don't need generalization. What I need is an efficient algorithm to find as many as possible combinations of columns (features) that completely separate targets from other observations, classify, that is.
For example:
f1 f2 f3
o1 1 1 0
t1 1 0 1
o2 0 1 1
Here {f1, f3} is an acceptable combo which separates target t1 from the rest (o1, o2) (btw, {f2} is NOT as by task definition a feature MUST be present in a target). In other words,
t1(f1) & t1(f3) = 1 and o1(f1) & o1(f3) = 0, o2(f1) & o2(f3) = 0
where '&' represents logical conjunction (AND).
The m is about 100,000, n is 1,000. Currently the data is packed into 128bit words along m and the search is optimized with sse4 and whatnot. Yet it takes way too long to obtain those feature combos.
After 2 billion calls to the tree descent routine it has covered about 15% of root nodes. And found about 8,000 combos which is a decent result for my particular application.
I use some empirical criteria to cut off less probable descent paths, not without limited success, but is there something radically better? Im pretty sure there gotta be?.. Any help, in whatever form, reference or suggestion, would be appreciated.
I believe the problem you describe is NP-Hard so you shouldn't expect to find the optimum solution in a reasonable time. I do not understand your current algorithm, but here are some suggestions on the top of my head:
1) Construct a decision tree. Label targets as A and non-targets as B and let the decision tree learn the categorization. At each node select the feature such that a function of P(target | feature) and P(target' | feature') is maximum. (i.e. as many targets as possible fall to positive side and as many non-targets as possible fall to negative side)
2) Use a greedy algorithm. Start from the empty set and at each time step add the feauture that kills the most non-target rows.
3) Use a randomized algorithm. Start from a small subset of positive features of some target, use the set as the seed for the greedy algorithm. Repeat many times. Pick the best solution. Greedy algorithm will be fast so it will be ok.
4) Use a genetic algorithm. Generate random seeds for the greedy algorithm as in 3 to generate good solutions and cross-product them (bitwise-and probably) to generate new candidates seeds. Remember the best solution. Keep good solutions as the current population. Repeat for many generations.
You will need to find the answer "how many of the given rows have the given feature f" fast so probably you'll need specialized data structures, perhaps using a BitArray for each feature.

Algorithm design to assign nodes to graphs

I have a graph-theoretic (which is also related to combinatorics) problem that is illustrated below, and wonder what is the best approach to design an algorithm to solve it.
Given 4 different graphs of 6 nodes (by different, I mean different structures, e.g. STAR, LINE, COMPLETE, etc), and 24 unique objects, design an algorithm to assign these objects to these 4 graphs 4 times, so that the number of repeating neighbors on the graphs over the 4 assignments is minimized. For example, if object A and B are neighbors on 1 of the 4 graphs in one assignment, then in the best case, A and B will not be neighbors again in the other 3 assignments.
Obviously, the degree to which such minimization can go is dependent on the specific graph structures given. But I am more interested in a general solution here so that given any 4 graph structures, such minimization is guaranteed as the result of the algorithm.
Any suggestion/idea of solving this problem is welcome, and some pseudo-code may well be sufficient to illustrate the design. Thank you.
Representation:
You have 24 elements, I will name this elements from A to X (24 first letters).
Each of these elements will have a place in one of the 4 graphs. I will assign a number to the 24 nodes of the 4 graphs from 1 to 24.
I will identify the position of A by a 24-uple =(xA1,xA2...,xA24), and if I want to assign A to the node number 8 for exemple, I will write (xa1,Xa2..xa24) = (0,0,0,0,0,0,0,1,0,0...0), where 1 is on position 8.
We can say that A =(xa1,...xa24)
e1...e24 are the unit vectors (1,0...0) to (0,0...1)
note about the operator '.':
A.e1=xa1
...
X.e24=Xx24
There are some constraints on A,...X with these notations :
Xii is in {0,1}
and
Sum(Xai)=1 ... Sum(Xxi)=1
Sum(Xa1,xb1,...Xx1)=1 ... Sum(Xa24,Xb24,... Xx24)=1
Since one element can be assign to only one node.
I will define a graph by defining the neighbors relation of each node, lets say node 8 has neighbors node 7 and node 10
to check that A and B are neighbors on node 8 for exemple I nedd:
A.e8=1 and B.e7 or B.e10 =1 then I just need A.e8*(B.e7+B.e10)==1
in the function isNeighborInGraphs(A,B) I test that for every nodes and I get one or zero depending on the neighborhood.
Notations:
4 graphs of 6 nodes, the position of each element is defined by an integer from 1 to 24.
(1 to 6 for first graph, etc...)
e1... e24 are the unit vectors (1,0,0...0) to (0,0...1)
Let A, B ...X be the N elements.
A=(0,0...,1,...,0)=(xa1,xa2...xa24)
B=...
...
X=(0,0...,1,...,0)
Graph descriptions:
IsNeigborInGraphs(A,B)=A.e1*B.e2+...
//if 1 and 2 are neigbors in one graph
for exemple
State of the system:
L(A)=[B,B,C,E,G...] // list of
neigbors of A (can repeat)
actualise(L(A)):
for element in [B,X]
if IsNeigbotInGraphs(A,Element)
L(A).append(Element)
endIf
endfor
Objective functions
N(A)=len(L(A))+Sum(IsneigborInGraph(A,i),i in L(A))
...
N(X)= ...
Description of the algorithm
start with an initial position
A=e1... X=e24
Actualize L(A),L(B)... L(X)
Solve this (with a solveur, ampl for
exemple will work I guess since it's
a nonlinear optimization
problem):
Objective function
min(Sum(N(Z),Z=A to X)
Constraints:
Sum(Xai)=1 ... Sum(Xxi)=1
Sum(Xa1,xb1,...Xx1)=1 ...
Sum(Xa24,Xb24,... Xx24)=1
You get the best solution
4.Repeat step 2 and 3, 3 more times.
If all four graphs are K_6, then the best you can do is choose 4 set partitions of your 24 objects into 4 sets each of cardinality 6 so that the pairwise intersection of any two sets has cardinality at most 2. You can do this by choosing set partitions that are maximally far apart in the Hasse diagram of set partitions with partial order given by refinement. The general case is much harder, but perhaps you can still begin with this crude approximation of a solution and then be clever with which vertex is assigned which object in the four assignments.
Assuming you don't want to cycle all combinations and calculate the sum every time and choose the lowest, you can implement a minimum problem (solved depending on your constraints using either a linear programming solver i.e. symplex algorithm engines or a non-linear solver, much harder talking in terms of time) with constraints on your variables (24) depending on the shape of your path. You can also use free software like LINGO/LINDO to create rapidly a decision theory model and test its correctness (you need decision theory notions though)
If this has anything to do with the real world, then it's unlikely that you absolutely must have a solution that is the true minimum. Close to the minimum should be good enough, right? If so, you could repeatedly randomly make the 4 assignments and check the results until you either run out of time or have a good-enough solution or appear to have stopped improving your best solution.

Resources