Parallel generation of partitions

Parallel generation of partitions - algorithm

I am using an algorithm (implemented in C) that generates partitions of a set. (The code is here: http://www.martinbroadhurst.com/combinatorial-algorithms.html#partitions).
I was wondering if there is a way to modify this algorithm to run in parallel instead of linearly?
I've got multiple cores on my CPU and would like split up the generation of partitions into multiple running threads.

Initialize a shared collection containing every partition of the first k elements. Each thread, until the collection is empty, repeatedly removes a partition from the collection and generates all possibilities for the remaining n - k elements using the algorithm you linked to (get another k-element prefix when incrementing the current n-element partition would change the one of the first k elements).

As you can see your referred algorithms creates counter in base n and each time put items with same number in one group, and in such a way partitions input.
Each counter counts from 0 to (0,1,2,...,n-1) which means A=n-1+(n-2)*n+...+1*nn-1+0 numbers. So you can run your algorithm on k different thread, in first thread you should count from 0 to A/k, in second you should count from (A/k)+1 to 2*A/k and so on. means just you should add a long variable and check it with upper bound (in your for loop conditions) Also calculating A value and related number in base n format for r*A/k for 0 <= r <= k.

First, consider the following variation of the serial algorithm. Take the element a, and assign it to the subset #0 (this is always valid, because the order of subsets inside a partition does not matter). The next element b might belong either to the same subset as a or to a different one, i.e. to subset #1. Then, the element c belongs to either #0 (together with a) or #1 (together with b if it's separate from a), or to its own subset (which will be #1 if #0={a,b}, or #2 if #0={a} and #1={b}). And so on. So you add new elements one by one to partially built partitions, producing a few possible outputs for each input - until you put all the elements. The key to parallelization is that each incomplete partition can be appended with new elements independently, i.e. in parallel with, all other variants.
The algorithm can be implemented in different ways. I would use a recursive approach, in which a function is given a partially filled array and its current length, copies the array as many times as there are possible values for the next element (which is one more than the current last value of the array), sets the next element to every possible value and calls itself recursively for each new array, with increased length. This approach seems particularly good for work-stealing parallel engines, such as cilk or tbb. An implementation similar to suggested by #swen is also possible: you use a collection of all incomplete partitions and a pool of threads, and each thread takes one partition from the collection, produces all possible extensions and put those back to the collection; partitions with all elements added should obviously go into a different collection.

Here is the c++ implementation I obtained using swen's suggestion. The number of threads depends on the value of r. For r=6 the number of partitions is the sixth bell number, which is equal to 203. For r=0 we just get a normal non-parallel program.
#include "omp.h"
#include <bits/stdc++.h>
using namespace std;
typedef long long lli;
const int MAX=10010;
const int MX=100;
int N,r=6;
int F[MAX]; // partitions first r
int Fa[MAX][MX]; // complete partitions
int P[MAX]; // first appearances first r
int Pa[MAX][MX]; // first appearances complete
int next(){// iterates to next partition of first r
for(int i=r-1;i>=0;i--){
P[F[i]]=i;
}
for(int i=r-1;i>=0;i--){
if( P[F[i]]!=i ){
F[i]++;
for(int j=i+1;j<r;j++){
F[j]=0;
}
return(1);
}
}
return(0);
}
int sig(int ID){// iterates to next partition in thread
for(int i=N-1;i>=0;i--){
Pa[ID][Fa[ID][i]]=i;
}
for(int i=N-1;i>=r;i--){
if( Pa[ID][Fa[ID][i]]!=i){
Fa[ID][i]++;
for(int j=i+1;j<N;j++){
Fa[ID][j]=0;
}
return(1);
}
}
return(0);
}
int main(){
int N;
scanf("%d",&N);
int t=1,partitions=0;
while(t || next() ){// save the current partition so we can use it for a thread later
t=0;
for(int i=0;i<r;i++){
Fa[partitions][i]=F[i];
}
partitions++;
}
omp_set_num_threads(partitions);
#pragma omp parallel
{
int ID = omp_get_thread_num();
int t=1;
while(t || sig(ID) ){// iterate through each partition in the thread
// the current partition in the thread is found in Fa[ID]
}
}
}

Related

Fill device array consecutively in CUDA

(This might be more of a theoretical parallel optimization problem then a CUDA specific problem per se. I'm very new to Parallel Programming in general so this may just be personal ignorance.)
I have a workload that consists of a 64-bit binary numbers upon which I run analysis. If the analysis completes successfully then that binary number is a "valid solution". If the analysis breaks midway then the number is "invalid". The end goal is to get a list of all the valid solutions.
Now there are many trillions of 64 bit binary numbers I am analyzing, but only ~5% or less will be valid solutions, and they usually come in bunches (i.e. every consecutive 1000 numbers are valid and then every random billion or so are invalid). I can't find a pattern to the space between bunches so I can't ignore the large chunks of invalid solutions.
Currently, every thread in a kernel call analyzes just one number. If the number is valid it denotes it as such in it's respective place on a device array. Ditto if it's invalid. So basically I generate a data point for very value analyzed regardless if it's valid or not. Then once the array is full I copy it to host only if a valid solution was found (denoted by a flag on the device). With this, overall throughput is greatest when the array is the same size as the # of threads in the grid.
But Copying Memory to & from the GPU is expensive time wise. That said what I would like to do is copy data over only when necessary; I want to fill up a device array with only valid solutions and then once the array is full then copy it over from the host. But how do you consecutively fill an array up in a parallel environment? Or am I approaching this problem the wrong way?
EDIT 1
This is the Kernel I initially developed. As you see I am generating 1 byte of data for each value analyzed. Now I really only need each 64 bit number which is valid; if I need be I can make a new kernel. As suggested by some of the commentators I am currently looking into stream compaction.
__global__ void kValid(unsigned long long*kInfo, unsigned char*values, char *solutionFound) {
//a 64 bit binary value to be evaluated is called a kValue
unsigned long long int kStart, kEnd, kRoot, kSize, curK;
//kRoot is the kValue at the start of device array, this is used is the device array is larger than the total threads in the grid
//kStart is the kValue to start this kernel call on
//kEnd is the last kValue to validate
//kSize is how many bits long is kValue (we don't necessarily use all 64 bits but this value stays constant over the entire chunk of values defined on the host
//curK is the current kValue represented as a 64 bit unsigned integer
int rowCount, kBitLocation, kMirrorBitLocation, row, col, nodes, edges;
kStart = kInfo[0];
kEnd = kInfo[1];
kRoot = kInfo[2];
nodes = kInfo[3];
edges = kInfo[4];
kSize = kInfo[5];
curK = blockIdx.x*blockDim.x + threadIdx.x + kStart;
if (curK > kEnd) {//check to make sure you don't overshoot the end value
return;
}
kBitLocation = 1;//assuming the first bit in the kvalue has a position 1;
for (row = 0; row < nodes; row++) {
rowCount = 0;
kMirrorBitLocation = row;//the bit position for the mirrored kvals is always starts at the row value (assuming the first row has a position of 0)
for (col = 0; col < nodes; col++) {
if (col > row) {
if (curK & (1 << (unsigned long long int)(kSize - kBitLocation))) {//add one to kIterator to convert to counting space
rowCount++;
}
kBitLocation++;
}
if (col < row) {
if (col > 0) {
kMirrorBitLocation += (nodes - 2) - (col - 1);
}
if (curK & (1 << (unsigned long long int)(kSize - kMirrorBitLocation))) {//if bit is set
rowCount++;
}
}
}
if (rowCount != edges) {
//set the ith bit to zero
values[curK - kRoot] = 0;
return;
}
}
//set the ith bit to one
values[curK - kRoot] = 1;
*solutionFound = 1; //not a race condition b/c it will only ever be set to 1 by any thread.
}

(This answer assumes output order is inconsequential and so are the positions of the valid values.)
Conceptually, your analysis produces a set of valid values. The implementation you described uses a dense representation of this set: One bit for every potential value. Yet you've indicated that the data is quite sparse (either 5e-2 or 1000/10^9 = 1e-6); moreover, copying data across PCI express is quite a pain.
Well, then, why not consider a sparse representation? The simplest one would be merely an unordered sequence of the valid values. Of course, writing that requires some synchronization across threads - perhaps even across blocks. Roughly, you can have warps collect their valid values in shared memory; then synchronize at the block level to collect the block's valid values (for a given chunk of the input it has analyzed); and finally use atomics to collect the data from all the blocks.
Oh, also - have each thread analyze multiple values, so you don't have to do that much synchronization.

So, you would want to have each thread analyze multiple numbers (thousands or millions) before you do a return from the computation. So if you analyze a million numbers in your thread, you will only need %5 of that amount of space to possible hold the results of that computation.

Grouping numbers in a list

I came across the following question,
You are given an array A of n elements. These elements are now added to a new list L which is initially empty , in a certain order based on the given q queries.
In each query you are given an integer i that corresponds to A[i] in the array A. This means that you have to add the element A[i] to the list L.
After each element is added to the list L, make groups among the elements in the list L. Two elements will be in same group if their indexes in the array A are consecutive.
For each group we define the group’s value as axb where a is the largest value in that group and b is the size of that group.
Print the maximum group value among all the groups that are formed after each element is added to the list L.
My approach was to use a map<int,vector<int>> where key is the group number and value is a vector containing group size, max. of group. I also had an array g and g[i] indicated group number of a[i], -1 if it is not in any group. The code below is a part of my implementation, but I'm sure there are better ways to solve this question as this solution of mine gave TLE and WA in some cases,and I can't seem to figure out the correct approach. Pls suggest optimal way to solve this.
int g[a.size()+2]; //+2 because queries start with index 1, and g[i] corresponds to a[i-1]
for(int i=0;i<a.size()+2;i++)
g[i]=-1;
int gno=1;
map<int,vector<int> > m;
vector<int> ans;
int mx=0;
for(unsigned int i=0;i<queries.size();i++){
int q = queries[i];
if(g[q-1]==-1 && g[q+1]==-1){
//create new group with current eleent as first element
g[q] = gno; //gno is the group number.
vector<int> v;
v.push_back(1);
v.push_back(a[q-1]);
m[gno]=v;
mx = max(mx,m[gno][0]*m[gno][1]);
gno++;
}
else if(g[q-1]!=-1 && g[q+1]==-1){
//join current element to left group
g[q] = g[q-1];
m[g[q]][0]++;
m[g[q]][1] = max(m[g[q]][1],a[q-1]);
mx = max(mx,m[g[q]][0]*m[g[q]][1]);
}
else if(g[q-1]==-1 && g[q+1]!=-1){
//join current element to right group
g[q] = g[q+1];
m[g[q]][0]++;
m[g[q]][1] = max(m[g[q]][1],a[q-1]);
mx = max(mx,m[g[q]][0]*m[g[q]][1]);
}
else{
//join both groups to left and right
g[q]=g[q-1];
int g1 = g[q];
int i;
m[g[q]][0] += 1 + m[g[q+1]][0];
m[g[q]][1] = max(m[g[q]][1],max(a[q-1],m[g[q+1]][1]));
for(i=q+1;g[i]==g[i+1];i++){
g[i]=g1;
}
g[i]=g1;
mx = max(mx,m[g[q]][0]*m[g[q]][1]);
}
ans.push_back(mx);
}
.

I would not actually build list L. It may be too costly in time to find what to do with a new value: is it a new group on itself, does it extend an existing group, do two groups need to merge into one? If the first values are all far apart, you'll have many groups, and you need to iterate them with each new incoming value: this is not efficient.
I would just collect all the values first and only then see how they fit in groups.
There are two ways to collect the values:
Store them in a list, and when all values have been collected, sort the list in ascending order
Flag the entry in an array of booleans of size n. This way you do not have to sort it, but afterwards you do need to iterate the whole array to find the values in ascending order.
Method 1 will be the best when q is a lot less than n. Method 2 will be better for greater q.
With both methods you'll be able to iterate over the found values in ascending order, and while doing so you can identify the groups, their value, and also keep track of the largest group-value. Only one sweep is needed to find the answer.

Let's start with two simplifying assumptions:
no duplicates. Once a given index i has been "queried", it will never be queried again.
no negative numbers. All elements are positive or zero, so the largest value in a group is always positive or zero, so expanding a group (or merging two groups) will never cause the overall "maximum group value" to decrease.
(Further below I'll show how to not require those assumptions, but for now this will simplify the picture.)
So, whenever we "query" an index i, there are four cases:
i-1 is currently the right-endpoint of a group (by which I mean its greatest index) and i+1 is currently the left-endpoint of another group.
In this case, we need to merge the two groups into a single group, with i bridging the gap between them.
i-1 is currently the right-endpoint of a group, but i+1 is not currently in any group.
In this case we need to extend the group to cover i.
i-1 is not currently in any group, but i+1 is currently the left-endpoint of a group.
In this case, as in the previous case, we need to extend the group to cover i.
Neither i-1 nor i+1 is in a group.
In this case, we have a new group with just one element.
In all cases, the key thing to note is that we're only interested in the endpoints of groups. So we don't need a general mapping from indices to their groups . . . which is good, because when we merge two groups, it would be expensive to then go and update every single index from one group to point to the other.
So we just need three mappings:
std::unordered_map<int, int> map_from_left_endpoint_to_right_endpoint;
std::unordered_map<int, int> map_from_right_endpoint_to_left_endpoint;
std::unordered_map<int, int> map_from_left_endpoint_to_largest_value;
To distinguish the four cases, we use e.g. map_from_right_endpoint_to_left_endpoint.find(i - 1) (which returns an iterator pointing to the left-endpoint of the group that i-1 is the right-endpoint of, if applicable; otherwise it returns map_from_right_endpoint_to_left_endpoint.end()). We then delete entries as they become no-longer-applicable (due to groups being extended or merged in a given direction), in addition to (obviously) inserting new entries, and updating the values of existing entries.
In addition to those values, we also need an
int maximum_group_value = 0;
and whenever we extend a group or merge two groups, we check whether the value of the resulting group (meaning its largest_value * (right_endpoint - left_endpoint + 1) is greater than maximum_group_value. If so, we update maximum_group_value and return it; if not, we return maximum_group_value as-is.
Now, what if duplicates are allowed, such that a given index i might be "queried" after it already belongs to a group?
The simplest approach is to simply keep track of which i-s have already been queried; but a more elegant approach, if desired, might be to change map_from_left_endpoint_to_right_endpoint from a std::unordered_map to a std::map, and then use something like this:
bool is_already_in_a_group(
std::map<int, int> const & map_from_left_endpoint_to_right_endpoint,
int const i) {
// get iterator to first element *after* index (or to 'end()' if no such):
auto iter = map_from_left_endpoint_to_right_endpoint.upper_bound(index);
// if that pointer points to 'begin()', then there are no elements
// at or before index:
if (iter == map_from_left_endpoint_to_right_endpoint.begin()) {
return false;
}
// otherwise, move iterator to point to the last element whose key is
// less than or equal to index:
--iter;
// . . . and check whether the value of that element is greater than
// or equal to index (meaning that [key, value] spans index):
return iter->second >= index;
}
to check if the greatest key in map_from_left_endpoint_to_right_endpoint that is less than or equal to i is mapped to a value that is greater than or equal to i.
This adds a fifth case to our case analysis above — "if i is already inside a group, just do nothing and return maximum_group_value" — but other than that, has no effect.
Note that this same approach also lets us eliminate map_from_right_endpoint_to_left_endpoint, if we want: the above function could easily be tweaked to int get_left_endpoint_for_right_endpoint by changing its return statement to return iter->second == index ? iter->first : -1;.
At this point it becomes sensible to define a Group class with three fields (left_endpoint, right_endpoint, and largest_value), and just keep a single map_from_left_endpoint_to_group.
Lastly — what if negative values are allowed, such that the "maximum group value" can actually decrease as the result of a query? (For example, if the array elements are [-1, -10] and the queries are i=0, i=1, then the results are maximum_group_value=-1, maximum_group_value=-2.) In such a case, we need to keep track of the values of all current groups, because any one of them might suddenly become the maximum.
For that, instead of storing a single int maximum_group_value, we can maintain a heap of groups, ordered by value, that we push into every time we create/extend/merge groups. (We can just use a std::vector<Group> for this, plus std::push_heap with an appropriate comparator, or with an appropriate definition for operator<(Group const &, Group const &).) After each query, we check if the top group on the heap (the first element in the vector) is still a group that actually exists; if so, we return its value, otherwise we pop it (using std::pop_heap) and repeat.
As an optimization, we can also store int maximum_group_value, and eliminate the heap once we've encountered a nonnegative array-element (since as soon as a given group contains a nonnegative array-element, its value can never decrease again, and obviously the maximum group value will be the value of one of those groups).

Sorting and Counting Elements in OpenCL

I want to create an OpenCL kernel that sorts and counts millions of ulong.
There is a particular algorithm that fits my needs or should I go for an hash table?
To be clear, given the following input:
[42, 13, 9, 42]
I would like to get an output like this:
[(9,1), (13,1), (42,2)]
My first idea was to modify the Counting Sort - which already counts in order to sort - but because of the wide range of ulongs it requires too much memory. Bitonic or Radix sort plus something to count elements could be a way but I miss a fast way to count the elements. Any suggestions on this?
Extra notes:
I'm developing using an NVIDIA Tesla K40C GPU and a Terasic DE5-Net FPGA. So far the main goal is to make it work on the GPU but I'm also interested in solutions that might be a nice fit for FPGAs.
I know that some values inside the range of ulong aren't used so we can use them to mark invalid elements or duplicates.
I want to consume the output from the GPU using multiple threads in the CPU so a would like to avoid any solution that require some post-processing (in the host side I mean) that has data dependencies spread around the output.

This solution requires two passes of the bitonic sort to both count the duplicates as well as remove them (well move them to the end of the array). Bitonic sort is O(log(n)^2), so this then will run with time complexity 2(log(n)^2), which shouldn't be a problem unless you are running this in a loop.
Create a simple struct for each of the elements, to include the number of duplicates, and if the element has been added as a duplicate, something like:
// Note: If you are worried about space, or know that there
// will only be a few duplicates for each element, then
// make the count element smaller
typedef struct {
cl_ulong value;
cl_ulong count : 63;
cl_ulong seen : 1;
} Element;
Algorithm:
You can start by creating a comparison function which will move duplicates to the end, and count the duplicates if they are you to be added to the total count for the element. This is the logic behind the comparison function:
If one element is a duplicate and another is not, return that the non-duplicate element is smaller (regardless of the values), which will move all duplicates to the end.
If the elements are duplicates and the right element has not been marked a duplicate (seen=0), then add the right element's count to the left element's count and set the right element as a duplicate (seen=1). This has the effect of moving the total count of an element with a specific value to the leftmost element in the array with that value, and all duplicates with that value to the end.
Otherwise return that the element with the smaller value, is smaller.
The comparison function would look like:
bool compare(const Element* E1, const Element* E2) {
if (!E1->seen && E2->seen) return true; // E1 smaller
if (!E2->seen && E1->seen) return false; // E2 smaller
// If the elements are duplicates and the right element has
// not yet been "seen" by an element with the same value
if (E1->value == E2->value && !E2->seen) {
E1->count += E2->count;
E2->seen = 1;
return true;
}
// They aren't duplicates, and either
// neither has been seen, or both have
return E1->value < E2->value;
}
Bitonic sort has a specific structure, which can be nicely illustrated with a diagram. In the diagram, each element is referred to by a 3-tuple (a,b,c) where a = value, b = count, and c = seen.
Each diagram shows one run of bitonic sort on the array (vertical lines denote a comparison between elements, and horizontal lines move right to the next stage of the bitonic sort). Using the diagram and the above comparison function and logic, you should be able to convince yourself that this does what is required.
Run 1:
Run 2:
At the end of run 2, all elements are arranged by value. Duplicates with seen = 1 are at the end, and duplicates with seen = 0 are in their correct place and count is the number of other elements with the same value.
Implementation:
The diagrams are color coded to illustrate the sub-processes of bitonic sort. I'll call the blue blocks a phase (there are three phases in each run in the diagrams). In general, there will be ceil(log(N)) phases for each run. Each phase consists of a number of green block (I'll call these out-in blocks, because the shape of the comparisons is out to in), and red blocks (I'll call these constant blocks, because the distance between elements to compare remains constant).
From the diagram, the out-in block size (elements in each block) starts at 2 and doubles in each pass. The constant block size for each pass starts at half the out-in block size (in the second (blue block) phase, there are 2 elements in each of the four red blocks, because the green blocks have a size of 4) and halves for each successive vertical lines of red block within the phase. Also, the number of successive vertical lines of the constant (red) blocks in a phase is always the same as the phase number with 0 indexing (0 vertical lines of red blocks for phase 0, 1 vertical line of red bocks for phase 1, and 2 vertical lines of red blocks for phase 2 -- each vertical line is an iteration of calling that kernel).
You can then make kernels for the out-in passes, and the constant passes, then invoke the kernels from the host side (because you need to constantly synchronise, which is a disadvantage, but you should still see large performance improvements over sequential implementations).
From the host side, the overall bitonic sort might look like:
cl_uint num_elements = 4; // Set number of elements
cl_uint phases = (cl_uint)ceil((float)log2(num_elements));
cl_uint out_in_block_size = 2;
cl_uint constant_block_size;
// Set the elements_buffer, which should have been created with
// with clCreateBuffer, as the first kernel argument, and the
// number of elements as the second kernel argument
clSetKernelArg(out_in_kernel, 0, sizeof(cl_mem), (void*)(&elements_buffer));
clSetKernelArg(out_in_kernel, 1, sizeof(cl_uint), (void*)(&num_elements));
clSetKernelArg(constant_kernel, 0, sizeof(cl_mem), (void*)(&elements_buffer));
clSetKernelArg(constant_kernel, 1, sizeof(cl_uint), (void*)(&num_elements));
// For each pass
for (unsigned int phase = 0; phase < phases; ++phase) {
// -------------------- Green Part ------------------------ //
// Set the out_in_block size for the kernel
clSetKernelArg(out_in_kernel, 2, sizeof(cl_int), (void*)(&out_in_block_size));
// Call the kernel - command_queue is the clCommandQueue
// which should have been created during cl setup
clEnqueNDRangeKernel(command_queue , // clCommandQueue
out_in_kernel , // The kernel
1 , // Work dim = 1 since 1D array
NULL , // No global offset
&global_work_size,
&local_work_size ,
0 ,
NULL ,
NULL);
barrier(CLK_GLOBAL_MEM_FENCE); // Synchronise
// ---------------------- End Green Part -------------------- //
// Set the block size for constant blocks based on the out_in_block_size
constant_block_size = out_in_block_size / 2;
// -------------------- Red Part ------------------------ //
for (unsigned int i 0; i < phase; ++i) {
// Set the constant_block_size as a kernel argument
clSetKernelArg(constant_kernel, 2, sizeof(cl_int), (void*)(&constant_block_size));
// Call the constant kernel
clEnqueNDRangeKernel(command_queue , // clCommandQueue
constant_kernel , // The kernel
1 , // Work dim = 1 since 1D array
NULL , // No global offset
&global_work_size,
&local_work_size ,
0 ,
NULL ,
NULL);
barrier(CLK_GLOBAL_MEM_FENCE); // Synchronise
// Update constant_block_size for next iteration
constant_block_size /= 2;
}
// ------------------- End Red Part ---------------------- //
}
And then the kernels would be something like (you also need to put the struct typedef in the kernel file so that the OpenCL compiler know what 'Element' is):
__global void out_in_kernel(__global Element* elements, unsigned int num_elements, unsigned int block_size) {
const unsigned int idx_upper = // index of upper element in diagram.
const unsigned int idx_lower = // index of lower element in diagram
// Check that both indices are in range (this depends on thread mapping)
if (idx_upper is in range && index_lower is in range) {
// Do the comparison
if (!compare(elements + idx_upper, elements + idx_lower) {
// Swap the elements
}
}
}
The constant_kernel will look the same, but the thread mapping (how you determine idx_upper and idx_lower) will be different. There are many ways you can map the threads to the elements generally to mimic the diagrams (note that the number of threads required is half the total number of elements, since each thread can do one comparison).
Another consideration is how to make the thread mapping general (so that if you have a number of elements which is not a power of two the algorithm doesn't break).

How about boost.compute or VexCL? Both provide sorting algorithms.

Mergesort works quite well on GPUs and you could modify it to sort key+count instead of keys only. During merging you would then also check if do keys are identical and if yes, fuse them into a single key during merge. (If you merge [9/c:1, 42/c:1] and [13/c:1,42/c:1] you would get [9/c:1,13/c:1,42/c:2] )
You might have to use parallel prefix sum to remove the gaps caused by fusing keys.
Or: Use a regular GPU sort first, mark all keys where the key to its right is different (this is only true at the last key of each unique key), use parallel prefix sum to get consecutive indexes for all unique keys and note their position in the sorted array. Then you only need to subtract the index of the previous unique key to get the count.

Is there an efficient data structure for row and column swapping?

I have a matrix of numbers and I'd like to be able to:
Swap rows
Swap columns
If I were to use an array of pointers to rows, then I can easily switch between rows in O(1) but swapping a column is O(N) where N is the amount of rows.
I have a distinct feeling there isn't a win-win data structure that gives O(1) for both operations, though I'm not sure how to prove it. Or am I wrong?

Without having thought this entirely through:
I think your idea with the pointers to rows is the right start. Then, to be able to "swap" the column I'd just have another array with the size of number of columns and store in each field the index of the current physical position of the column.
m =
[0] -> 1 2 3
[1] -> 4 5 6
[2] -> 7 8 9
c[] {0,1,2}
Now to exchange column 1 and 2, you would just change c to {0,2,1}
When you then want to read row 1 you'd do
for (i=0; i < colcount; i++) {
print m[1][c[i]];
}

Just a random though here (no experience of how well this really works, and it's a late night without coffee):
What I'm thinking is for the internals of the matrix to be a hashtable as opposed to an array.
Every cell within the array has three pieces of information:
The row in which the cell resides
The column in which the cell resides
The value of the cell
In my mind, this is readily represented by the tuple ((i, j), v), where (i, j) denotes the position of the cell (i-th row, j-th column), and v
The would be a somewhat normal representation of a matrix. But let's astract the ideas here. Rather than i denoting the row as a position (i.e. 0 before 1 before 2 before 3 etc.), let's just consider i to be some sort of canonical identifier for it's corresponding row. Let's do the same for j. (While in the most general case, i and j could then be unrestricted, let's assume a simple case where they will remain within the ranges [0..M] and [0..N] for an M x N matrix, but don't denote the actual coordinates of a cell).
Now, we need a way to keep track of the identifier for a row, and the current index associated with the row. This clearly requires a key/value data structure, but since the number of indices is fixed (matrices don't usually grow/shrink), and only deals with integral indices, we can implement this as a fixed, one-dimensional array. For a matrix of M rows, we can have (in C):
int RowMap[M];
For the m-th row, RowMap[m] gives the identifier of the row in the current matrix.
We'll use the same thing for columns:
int ColumnMap[N];
where ColumnMap[n] is the identifier of the n-th column.
Now to get back to the hashtable I mentioned at the beginning:
Since we have complete information (the size of the matrix), we should be able to generate a perfect hashing function (without collision). Here's one possibility (for modestly-sized arrays):
int Hash(int row, int column)
{
return row * N + column;
}
If this is the hash function for the hashtable, we should get zero collisions for most sizes of arrays. This allows us to read/write data from the hashtable in O(1) time.
The cool part is interfacing the index of each row/column with the identifiers in the hashtable:
// row and column are given in the usual way, in the range [0..M] and [0..N]
// These parameters are really just used as handles to the internal row and
// column indices
int MatrixLookup(int row, int column)
{
// Get the canonical identifiers of the row and column, and hash them.
int canonicalRow = RowMap[row];
int canonicalColumn = ColumnMap[column];
int hashCode = Hash(canonicalRow, canonicalColumn);
return HashTableLookup(hashCode);
}
Now, since the interface to the matrix only uses these handles, and not the internal identifiers, a swap operation of either rows or columns corresponds to a simple change in the RowMap or ColumnMap array:
// This function simply swaps the values at
// RowMap[row1] and RowMap[row2]
void MatrixSwapRow(int row1, int row2)
{
int canonicalRow1 = RowMap[row1];
int canonicalRow2 = RowMap[row2];
RowMap[row1] = canonicalRow2
RowMap[row2] = canonicalRow1;
}
// This function simply swaps the values at
// ColumnMap[row1] and ColumnMap[row2]
void MatrixSwapColumn(int column1, int column2)
{
int canonicalColumn1 = ColumnMap[column1];
int canonicalColumn2 = ColumnMap[column2];
ColumnMap[row1] = canonicalColumn2
ColumnMap[row2] = canonicalColumn1;
}
So that should be it - a matrix with O(1) access and mutation, as well as O(1) row swapping and O(1) column swapping. Of course, even an O(1) hash access will be slower than the O(1) of array-based access, and more memory will be used, but at least there is equality between rows/columns.
I tried to be as agnostic as possible when it comes to exactly how you implement your matrix, so I wrote some C. If you'd prefer another language, I can change it (it would be best if you understood), but I think it's pretty self descriptive, though I can't ensure it's correctedness as far as C goes, since I'm actually a C++ guys trying to act like a C guy right now (and did I mention I don't have coffee?). Personally, writing in a full OO language would do it the entrie design more justice, and also give the code some beauty, but like I said, this was a quickly whipped up implementation.

An efficient way to find matching items in N lists?

Given a number of lists of items, find the lists with matching items.
The brute force pseudo-code for this problem looks like:
foreach list L
foreach item I in list L
foreach list L2 such that L2 != L
for each item I2 in L2
if I == I2
return new 3-tuple(L, L2, I) //not important for the algorithm
I can think of a number of different ways of going about this - creating a list of lists and removing each candidate list after searching the others for example - but I'm wondering if there is a better algorithm for this?
I'm using Java, if that makes a difference to your implementation.
Thanks

Create a Map<Item,List<List>>.
Iterate through every item in every list.
each time you touch an item, add the current list to that item's entry in the Map.
You now have a Map entry for each item that tells you what lists that item appears in.
This algorithm is about O(N) where N is the number of lists (the exact complexity will be affected by how good your Map implementation is). I believe your algorithm was at least O(N^2).
Caveat: I am comparing number of comparisons, not memory use. If your lists are super huge and full of mostly non duplicated items, the map that my method creates might become too big.

As per your comment you want a MultiMap implementation. A multimap is like a Map but it can map each key to multiple values. Store the value and a reference to all the maps that contain that value.
Map<Object, List>
of course you should use a type safe instead of Object and a type safe List as the value. What you are trying to do is called an Inverted Index.

I'll start with the assumption that the datasets can fit in memory. If not, then you will need something fancier.
I refer below to a "set", where I am thinking of something like a C++ std::set. I don't know the Java equivalent, but any storage scheme that permits rapid lookup (tree, hash table, whatever).
Comparing three lists: L0, L1 and L2.
Read L0, placing each element in a set: S0.
Read L1, placing items that match an element of S0 into a new set: S1, and discarding others.
Discard S0.
Read L2, keeping items that match an element of S1 and discarding others.
Update
Just realised that the question was for "n" lists, not three. However the extension should be obvious. (I hope)
Update 2
Some untested C++ code to illustrate the algorithm
#include <string>
#include <vector>
#include <set>
#include <cassert>
typedef std::vector<std::string> strlist_t;
strlist_t GetMatches(std::vector<strlist_t> vLists)
{
assert(vLists.size() > 1);
std::set<std::string> s0, s1;
std::set<std::string> *pOld = &s1;
std::set<std::string> *pNew = &s0;
// unconditionally load first list as "new"
s0.insert(vLists[0].begin(), vLists[0].end());
for (size_t i=1; i<vLists.size(); ++i)
{
//swap recently read "new" to "old" now for comparison with new list
std::swap(pOld, pNew);
pNew->clear();
// only keep new elements if they are matched in old list
for (size_t j=0; j<vLists[i].size(); ++j)
{
if (pOld->end() != pOld->find(vLists[i][j]))
{
// found match
pNew->insert(vLists[i][j]);
}
}
}
return strlist_t(pNew->begin(), pNew->end());
}

You can use a trie, modified to record what lists each node belongs to.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio