Largest subset of lines with two unique columns

Largest subset of lines with two unique columns - algorithm

Given a text file with two columns, produce the largest possible subset of lines for which no value is repeated within either column.
For example, given these four lines :
1 a
1 b
2 a
2 b
One can use something like "sort -u" on the command line, to unique first on column 1, leaving
1 a
2 a
and then on column two, leaving just
1 a
This satisfies "no value is repeated" but not "largest possible subset"
In an ideal world, I would have produced either
1 a
2 b
or
1 b
2 a
Given the further constraint that these files might be many gigabytes (i.e. much larger than available RAM, but much smaller than available disk), I can't just keep all the values in a data structure.
Can anyone think of an approach?
I would also be happy with "a pretty large subset", if I can't literally get "the largest possible subset"
If I sort by (column 1 ascending and then column 2 random), uniq'ing on column 1 will give me slightly better results, but I feel like there's something simple that I'm missing.

For each unique item from col 1 create a list of unique items from col 2. Then starting with the smallest of lists build the final output by taking first value from each list and each col-1-item, that has not been used in the output yet.

Related

Pseudocode of sorting a list of strings without using loops

I was trying to think of an algorithm that would sort a list of strings according to its first 4 chars (say each line from a file), without using the conventional looping methods such as while,for. An example of inputs would be:
1231COME1900123
1233COME1902030
2031COME1923919
1231GO 1231203
1233GO 1932911
2031GO 1239391
The thing is, we do not know the number of records there can be beforehand. And each 4-digit ID number can have multiple COME and GO records. But they are sorted as above beforehand. And I want to sort the file by their 4-digit ID number. And achieve this:
1231COME1900123
1231GO 1231203
1233COME1902030
1233GO 1932911
2031COME1923919
2031GO 1239391
The only logical comment I can have is that we should be using a recursive way to read through the records, but the sorting part is a bit tricky for me. Also GOTO could be used as well. Any ideas?

Assuming that the 1st 4 characters of each entry are always digits, you do something as follows:
Create a list of length 10000, where each element can hold a pair of values.
Enter into that element of the list based upon the first 4 digits.
The shape of the individual elements will be as follows -> [COME_ELEMENT, GO_ELEMENT].
Each COME_ELEMENT and GO_ELEMENT is a list in itself, of length equal to the maximum value + 1 that can appear after the words COME & GO.
Now, as the string arrives break it at the 1st 4 digits. Now, go to that element of the list.
After that, check whether it's a go or come.
If it's a go (suppose), then determine the number after the word go.
Insert the string at the index (determined in 7th step) in the inner list.
When you're done with inserting values, just traverse the non-empty elements.
The result so obtained will contain the sorted order that you require without the use of looping.

What is the good approach in solving this programming challenge?

In one programming contest, this problem was given.
A database contains a table with two columns.
First is the id of the member,
Second can be
0(if he doesn't have any sub-ordinates),
id(if only one sub-ordinate),
sum of id's(if he has two sub-ordinates)
//Max Two assistants only.
We need to find the head of the gang
Example Input:
The first line indicates 'n' [the number of records,3<n<100]
the next four are the actual records
4
1 7
2 1
3 0
4 0
Here 3,4 has 0 in their second columns which means they don't have any sub-ordinates.
1 has 7 in the second column which is not the id of any of the member ,so it can be the sum of two id's[so 3,4 are sub-ordinates of 7]
2 has 1 as the sub-ordinate
so 2 is the head of the gang.
Output:
2
I am unable to solve the problem.
Can anyone help me?
If it is not a correct place to ask this type of question means
Can suggest me some websites where I can post these type of questions?

I will give you a hint (which is almost a solution) here:
What is the sum of all the numbers in the second column?
Answer (spoiler alert):
The id of the head of the gang (if exists) is: 1 + 2 + ... + n - (the sum of all the numbers in the second column). Note that, the above number actually gives the sum of the id's of all top-level members (i.e. members who do not have any sub-ordinates). Thus the correctness relies on the assumption that there exists one unique head of the gang.

Possible NxN matrices, t 1's in each row and column, none in diagonal?

Background:
This is extra credit in a logic and algorithms class, we are currently covering propositional logic, P implies Q that kind of thing, so I think the Prof wanted to give us and assignment out of our depth.
I will implement this in C++, but right now I just want to understand whats going on in the example....which I don't.
Example
Enclosed is a walkthrough for the Lefty algorithm which computes the number
of nxn 0-1 matrices with t ones in each row and column, but none on the main
diagonal.
The algorithm used to verify the equations presented counts all the possible
matrices, but does not construct them.
It is called "Lefty", it is reasonably simple, and is best described with an
example.
Suppose we wanted to compute the number of 6x6 0-1 matrices with 2 ones
in each row and column, but no ones on the main diagonal. We first create a
state vector of length 6, filled with 2s:
(2 2 2 2 2 2)
This state vector symbolizes the number of ones we must yet place in each
column. We accompany it with an integer which we call the "puck", which is
initialized to 1. This puck will increase by one each time we perform a ones
placement in a row of the matrix (a "round"), and we will think of the puck as
"covering up" the column that we wonít be able to place ones in for that round.
Since we are starting with the first row (and hence the first round), we place
two ones in any column, but since the puck is 1, we cannot place ones in the
first column. This corresponds to the forced zero that we must place in the first
column, since the 1,1 entry is part of the matrixís main diagonal.
The algorithm will iterate over all possible choices, but to show each round,
we shall make a choice, say the 2nd and 6th columns. We then drop the state
vector by subtracting 1 from the 2nd and 6th values, and advance the puck:
(2 1 2 2 2 1); 2
For the second round, the puck is 2, so we cannot place a one in that column.
We choose to place ones in the 4th and 6th columns instead and advance the
puck:
(2 1 2 1 2 0); 3
Now at this point, we can place two ones anywhere but the 3rd and 6th
columns. At this stage the algorithm treats the possibilities di§erently: We
can place some ones before the puck (in the column indexes less than the puck
value), and/or some ones after the puck (in the column indexes greater than
the puck value). Before the puck, we can place a one where there is a 1, or
where there is a 2; after the puck, we can place a one in the 4th or 5th columns.
Suppose we place ones in the 4th and 5th columns. We drop the state vector
and advance the puck once more:
(2 1 2 0 1 0); 4
1
For the 4th round, we once again notice we can place some ones before the
puck, and/or some ones after.
Before the puck, we can place:
(a) two ones in columns of value 2 (1 choice)
(b) one one in the column of value 2 (2 choices)
(c) one one in the column of value 1 (1 choice)
(d) one one in a column of value 2 and one one in a column of value 1 (2
choices).
After we choose one of the options (a)-(d), we must multiply the listed
number of choices by one for each way to place any remaining ones to the right
of the puck.
So, for option (a), there is only one way to place the ones.
For option (b), there are two possible ways for each possible placement of
the remaining one to the right of the puck. Since there is only one nonzero value
remaining to the right of the puck, there are two ways total.
For option (c), there is one possible way for each possible placement of the
remaining one to the right of the puck. Again, since there is only one nonzero
value remaining, there is one way total.
For option (d), there are two possible ways to place the ones.
We choose option (a). We drop the state vector and advance the puck:
(1 1 1 0 1 0); 5
Since the puck is "covering" the 1 in the 5th column, we can only place
ones before the puck. There are (3 take 2) ways to place two ones in the three
columns of value 1, so we multiply 3 by the number of ways to get remaining
possibilities. After choosing the 1st and 3rd columns (though it doesnít matter
since weíre left of the puck; any two of the three will do), we drop the state
vector and advance the puck one final time:
(0 1 0 0 1 0); 6
There is only one way to place the ones in this situation, so we terminate
with a count of 1. But we must take into account all the multiplications along
the way: 1*1*1*1*3*1 = 3.
Another way of thinking of the varying row is to start with the first matrix,
focus on the lower-left 2x3 submatrix, and note how many ways there were to
permute the columns of that submatrix. Since there are only 3 such ways, we
get 3 matrices.
What I think I understand
This algorithm counts the the all possible 6x6 arrays with 2 1's in each row and column with none in the descending diagonal.
Instead of constructing the matrices it uses a "state_vector" filled with 6 2's, representing how many 2's are in that column, and a "puck" that represents the index of the diagonal and the current row as the algorithm iterates.
What I don't understand
The algorithm comes up with a value of 1 for each row except 5 which is assigned a 3, at the end these values are multiplied for the end result. These values are supposed to be the possible placements for each row but there are many possibilities for row 1, why was it given a one, why did the algorithm wait until row 5 to figure all the possible permutations?
Any help will be much appreciated!

I think what is going on is a tradeoff between doing combinatorics and doing recursion.
The algorithm is using recursion to add up all the counts for each choice of placing the 1's. The example considers a single choice at each stage, but to get the full count it needs to add the results for all possible choices.
Now it is quite possible to get the final answer simply using recursion all the way down. Every time we reach the bottom we just add 1 to the total count.
The normal next step is to cache the result of calling the recursive function as this greatly improves the speed. However, the memory use for such a dynamic programming approach depends on the number of states that need to be expanded.
The combinatorics in the later stages is making use of the fact that once the puck has passed a column, the exact arrangement of counts in the columns doesn't matter so you only need to evaluate one representative of each type and then add up the resulting counts multiplied by the number of equivalent ways.
This both reduces the memory use and improves the speed of the algorithm.
Note that you cannot use combinatorics for counts to the right of the puck, as for these the order of the counts is still important due to the restriction about the diagonal.
P.S. You can actually compute the number of ways for counting the number of n*n matrices with 2 1's in each column (and no diagonal entries) with pure combinatorics as:
a(n) = Sum_{k=0..n} Sum_{s=0..k} Sum_{j=0..n-k} (-1)^(k+j-s)*n!*(n-k)!*(2n-k-2j-s)!/(s!*(k-s)!*(n-k-j)!^2*j!*2^(2n-2k-j))
According to OEIS.

Algorithm X to Solve the Exact Cover: Fat Matrices

As I was reading about Knuth's Algorithm X to solve the exact cover problem, I thought of an edge case that I wanted some clarification on.
Here are my assumptions:
Given a matrix A, Algorithm X's "goal is to select a subset of the rows so that the digit 1 appears in each column exactly once."
If the matrix is empty, the algorithm terminates successfully and the solution is then the subset of rows logged in the partial solution up to that point.
If there is a column of 0's, the algorithm terminates unsuccessfully.
For reference: http://en.wikipedia.org/wiki/Algorithm_X
Consider the matrix A:
[[1 1 0]
[0 1 1]]
Steps I took:
Given Matrix A:
1. Choose a column, c, with the least number of 1's. I choose: column 1
2. Choose a row, r, that contains to a 1 in column c. I choose: row 1
3. Add r to the partial solution.
4. For each column j such that A(r, j) = 1:
For each row i such that A(i, j) = 1:
delete row i
delete column j
5. Matrix A is empty. Algorithm terminates successfully and solution is allegedly: {row 1}.
However, this is clearly not the case as row 1 only consists of [1 1 0] and clearly does not cover the 3rd column.
I would assume that the algorithm should at some point reduce the matrix to the point where there is only a single 0 and terminate unsuccessfully.
Could someone please explain this?

I think the confusion here is simply in the use of the term empty matrix. If you read Knuth's original paper (linked on the Wikipedia article you cited), you can see that he was treating the rows and columns as doubly-linked lists. When he says that the matrix is empty, he doesn't mean that it has no entries, he means that all the row and column objects have been deleted.
To clarify, I'll label the rows with lower case letters and the columns with upper case letters, as follows:
| A | B | C
---------------
a | 1 | 1 | 0
---------------
b | 0 | 1 | 1
The algorithm states that you choose a column deterministically (using any rule you wish), and he suggests choosing a column with the fewest number of 1's. So, we'll proceed as you suggest and choose column A. The only row with a 1 in column A is row a, so we choose row a and add it to the possible solution { a }. Now, row a has 1s in columns A and B, so we must delete those columns, and any rows containing 1s in those columns, that is, rows a and b, just as you did. The resulting matrix has a single column C and no rows:
| C
-------
This is not an empty matrix (it has a column remaining). However, column C has no 1s in it, so we terminate unsuccessfully, as the algorithm indicates.
This may seem odd, but it is a very important case if we intend to use an incidence matrix for the Exact Cover Problem, because columns represent elements of the set X that we wish to cover and rows represents subsets of X. So a matrix with some columns and no rows represents the exact covering problem where the collection of subsets to choose from is empty (but there are still points to cover).
If this description causes problems for your implementation, there is a simple workaround: just include the empty set in every problem. The empty set (containing no points of X) is represented by a row of all zeros. It is never selected by your algorithm as part of a solution, never collides with any other selected rows, but always ensures that the matrix is nonempty (there is at least one row) until all the columns have been deleted, which is really all you care about since you need to make sure that each column is covered by some row.

Solving ACM ICPC - SEERC 2009

I have been sitting on this for almost a week now. Here is the question in a PDF format.
I could only think of one idea so far but it failed. The idea was to recursively create all connected subgraphs which works in O(num_of_connected_subgraphs), but that is way too slow.
I would really appreciate someone giving my a direction. I'm inclined to think that the only way is dynamic programming but I can't seem to figure out how to do it.

OK, here is a conceptual description for the algorithm that I came up with:
Form an array of the (x,y) board map from -7 to 7 in both dimensions and place the opponents pieces on it.
Starting with the first row (lowest Y value, -N):
enumerate all possible combinations of the 2nd player's pieces on the row, eliminating only those that conflict with the opponents pieces.
for each combination on this row:
--group connected pieces into separate networks and number these
networks starting with 1, ascending
--encode the row as a vector using:
= 0 for any unoccupied or opponent position
= (1-8) for the network group that that piece/position is in.
--give each such grouping a COUNT of 1, and add it to a dictionary/hashset using the encoded vector as its key
Now, for each succeeding row, in ascending order {y=y+1}:
For every entry in the previous row's dictionary:
--If the entry has exactly 1 group, add it's COUNT to TOTAL
--enumerate all possible combinations of the 2nd player's pieces
on the current row, eliminating only those that conflict with the
opponents pieces. (change:) you should skip the initial combination
(where all entries are zero) for this step, as the step above actually
covers it. For each such combination on the current row:
+ produce a grouping vector as described above
+ compare the current row's group-vector to the previous row's
group-vector from the dictionary:
++ if there are any group-*numbers* from the previous row's
vector that are not adjacent to any gorups in the current
row's vector, *for at least one value of X*, then skip
to the next combination.
++ any groups for the current row that are adjacent to any
groups of the previous row, acquire the lowest such group
number
++ any groups for the current row that are not adjacent to
any groups of the previous row, are assigned an unused
group number
+ Re-Normalize the group-number assignments for the current-row's
combination (**) and encode the vector, giving it a COUNT equal
to the previous row-vector's COUNT
+ Add the current-row's vector to the dictionary for the current
Row, using its encoded vector as the key. If it already exists,
then add it's COUNT to the COUNT for the pre-exising entry
Finally, for every entry in the dictionary for the last row:
If the entry has exactly one group, then add it's COUNT to TOTAL
**: Re-Normalizing simply means to re-assign the group numbers so as to eliminate any permutations in the grouping pattern. Specifically, this means that new group numbers should be assigned in increasing order, from left-to-right, starting from one. So for example, if your grouping vector looked like this after grouping ot to the previous row:
2 0 5 5 0 3 0 5 0 7 ...
it should be re-mapped to this normal form:
1 0 2 2 0 3 0 2 0 4 ...
Note that as in this example, after the first row, the groupings can be discontiguous. This relationship must be preserved, so the two groups of "5"s are re-mapped to the same number ("2") in the re-normalization.
OK, a couple of notes:
A. I think that this approach is correct , but I I am really not certain, so it will definitely need some vetting, etc.
B. Although it is long, it's still pretty sketchy. Each individual step is non-trivial in itself.
C. Although there are plenty of individual optimization opportunities, the overall algorithm is still pretty complicated. It is a lot better than brute-force, but even so, my back-of-the-napkin estimate is still around (2.5 to 10)*10^11 operations for N=7.
So it's probably tractable, but still a long way off from doing 74 cases in 3 seconds. I haven't read all of the detail for Peter de Revaz's answer, but his idea of rotating the "diamond" might be workable for my algorithm. Although it would increase the complexity of the inner loop, it may drop the size of the dictionaries (and thus, the number of grouping-vectors to compare against) by as much as a 100x, though it's really hard to tell without actually trying it.
Note also that there isn't any dynamic programming here. I couldn't come up with an easy way to leverage it, so that might still be an avenue for improvement.
OK, I enumerated all possible valid grouping-vectors to get a better estimate of (C) above, which lowered it to O(3.5*10^9) for N=7. That's much better, but still about an order of magnitude over what you probably need to finish 74 tests in 3 seconds. That does depend on the tests though, if most of them are smaller than N=7, it might be able to make it.

Here is a rough sketch of an approach for this problem.
First note that the lattice points need |x|+|y| < N, which results in a diamond shape going from coordinates 0,6 to 6,0 i.e. with 7 points on each side.
If you imagine rotating this diamond by 45 degrees, you will end up with a 7*7 square lattice which may be easier to think about. (Although note that there are also intermediate 6 high columns.)
For example, for N=3 the original lattice points are:
..A..
.BCD.
EFGHI
.JKL.
..M..
Which rotate to
A D I
C H
B G L
F K
E J M
On the (possibly rotated) lattice I would attempt to solve by dynamic programming the problem of counting the number of ways of placing armies in the first x columns such that the last column is a certain string (plus a boolean flag to say whether some points have been placed yet).
The string contains a digit for each lattice point.
0 represents an empty location
1 represents an isolated point
2 represents the first of a new connected group
3 represents an intermediate in a connected group
4 represents the last in an connected group
During the algorithm the strings can represent shapes containing multiple connected groups, but we reject any transformations that leave an orphaned connected group.
When you have placed all columns you need to only count strings which have at most one connected group.
For example, the string for the first 5 columns of the shape below is:
....+ = 2
..+++ = 3
..+.. = 0
..+.+ = 1
..+.. = 0
..+++ = 3
..+++ = 4
The middle + is currently unconnected, but may become connected by a later column so still needs to be tracked. (In this diagram I am also assuming a up/down/left/right 4-connectivity. The rotated lattice should really use a diagonal connectivity but I find that a bit harder to visualise and I am not entirely sure it is still a valid approach with this connectivity.)
I appreciate that this answer is not complete (and could do with lots more pictures/explanation), but perhaps it will prompt someone else to provide a more complete solution.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio