Coding for the Cache and Hit rate for the algorithm - algorithm

i know hit rate is the percentage of found data in cache. but i have no idea how to find the hit for an algorithm. i was thinking that for code 1 i will have 11 blocks each with 4 elements and for code 2 i'll have 4 blocks each with 11 elements and each time i see 4 elements missed. not sure if that make sense at all. any advise is welcomed
Suppose a 2-dimensional array A with 11 rows by 4 columns, stored in memory like this [0][0], [0][1], [0][2], [0][3], [1][0], [1][1], …[10][2], [10][3]
Also suppose a fully associative single level cache of 10 memory blocks, with each memory block holding 4 bytes, and a FIFO replacement policy.
Each row fits exactly into one cache block and rather unluckily, the whole array cannot fit into cache. the cache is one row too small...
Now given the 2 following codes,
1- how do i calculate the hit rate
2- given that cache access time is 5ns and memory access time is 70ns, and assuming overlapping access to memory and cache, how do i calculate the EAT for each code?
Code 1:
for (int row = 0; row < 11; row ++)
{
for (int column = 0; column < 4; column ++)
{
myByte = A [ row, column ];
}
}
Code 2:
for (int column = 0; column < 4; column ++)
{
for (int row = 0; row < 11; row ++)
{
myByte = A [ row, column ];
}
}
Any help is appreciated. Thanks

Well we dont calculate hit rate for an algorithm. We write algorithm in such a way which utilizies the concept of cache i.e to get maximum hit rate. Your question actually belongs to Computer Organization and Architecture.
Anyways, your 2D - array is stored in memory in a row major order. So, for code 1,
with first inner for loop a block would be fetched from memory containing four element since nothing was found in the cache (as it is a miss) (i.e memory access time 70ns).Now subsequent three element i.e [0][1],[0][2],[0][3] would be fetched from cahce(i.e 3*5 = 15ns) i.e total time for accessing first four element = 70+3*5 = 85ns.
The above process would repeat for ten rows of your array w.r.t you 10 cache block. but for the last block using FIFO concept swaping of block would take place but the time would remain same. so total time = 85*11 = 935ns
for code 2,
for every iteration of inner for loop a block would be fetched. So, every time you are accessing memory and in that case you are not utilizing the concept of cache i.e its a bad piece of code. The code 2 would work best if your array is stored in column major order in memory.

Related

Fill device array consecutively in CUDA

(This might be more of a theoretical parallel optimization problem then a CUDA specific problem per se. I'm very new to Parallel Programming in general so this may just be personal ignorance.)
I have a workload that consists of a 64-bit binary numbers upon which I run analysis. If the analysis completes successfully then that binary number is a "valid solution". If the analysis breaks midway then the number is "invalid". The end goal is to get a list of all the valid solutions.
Now there are many trillions of 64 bit binary numbers I am analyzing, but only ~5% or less will be valid solutions, and they usually come in bunches (i.e. every consecutive 1000 numbers are valid and then every random billion or so are invalid). I can't find a pattern to the space between bunches so I can't ignore the large chunks of invalid solutions.
Currently, every thread in a kernel call analyzes just one number. If the number is valid it denotes it as such in it's respective place on a device array. Ditto if it's invalid. So basically I generate a data point for very value analyzed regardless if it's valid or not. Then once the array is full I copy it to host only if a valid solution was found (denoted by a flag on the device). With this, overall throughput is greatest when the array is the same size as the # of threads in the grid.
But Copying Memory to & from the GPU is expensive time wise. That said what I would like to do is copy data over only when necessary; I want to fill up a device array with only valid solutions and then once the array is full then copy it over from the host. But how do you consecutively fill an array up in a parallel environment? Or am I approaching this problem the wrong way?
EDIT 1
This is the Kernel I initially developed. As you see I am generating 1 byte of data for each value analyzed. Now I really only need each 64 bit number which is valid; if I need be I can make a new kernel. As suggested by some of the commentators I am currently looking into stream compaction.
__global__ void kValid(unsigned long long*kInfo, unsigned char*values, char *solutionFound) {
//a 64 bit binary value to be evaluated is called a kValue
unsigned long long int kStart, kEnd, kRoot, kSize, curK;
//kRoot is the kValue at the start of device array, this is used is the device array is larger than the total threads in the grid
//kStart is the kValue to start this kernel call on
//kEnd is the last kValue to validate
//kSize is how many bits long is kValue (we don't necessarily use all 64 bits but this value stays constant over the entire chunk of values defined on the host
//curK is the current kValue represented as a 64 bit unsigned integer
int rowCount, kBitLocation, kMirrorBitLocation, row, col, nodes, edges;
kStart = kInfo[0];
kEnd = kInfo[1];
kRoot = kInfo[2];
nodes = kInfo[3];
edges = kInfo[4];
kSize = kInfo[5];
curK = blockIdx.x*blockDim.x + threadIdx.x + kStart;
if (curK > kEnd) {//check to make sure you don't overshoot the end value
return;
}
kBitLocation = 1;//assuming the first bit in the kvalue has a position 1;
for (row = 0; row < nodes; row++) {
rowCount = 0;
kMirrorBitLocation = row;//the bit position for the mirrored kvals is always starts at the row value (assuming the first row has a position of 0)
for (col = 0; col < nodes; col++) {
if (col > row) {
if (curK & (1 << (unsigned long long int)(kSize - kBitLocation))) {//add one to kIterator to convert to counting space
rowCount++;
}
kBitLocation++;
}
if (col < row) {
if (col > 0) {
kMirrorBitLocation += (nodes - 2) - (col - 1);
}
if (curK & (1 << (unsigned long long int)(kSize - kMirrorBitLocation))) {//if bit is set
rowCount++;
}
}
}
if (rowCount != edges) {
//set the ith bit to zero
values[curK - kRoot] = 0;
return;
}
}
//set the ith bit to one
values[curK - kRoot] = 1;
*solutionFound = 1; //not a race condition b/c it will only ever be set to 1 by any thread.
}
(This answer assumes output order is inconsequential and so are the positions of the valid values.)
Conceptually, your analysis produces a set of valid values. The implementation you described uses a dense representation of this set: One bit for every potential value. Yet you've indicated that the data is quite sparse (either 5e-2 or 1000/10^9 = 1e-6); moreover, copying data across PCI express is quite a pain.
Well, then, why not consider a sparse representation? The simplest one would be merely an unordered sequence of the valid values. Of course, writing that requires some synchronization across threads - perhaps even across blocks. Roughly, you can have warps collect their valid values in shared memory; then synchronize at the block level to collect the block's valid values (for a given chunk of the input it has analyzed); and finally use atomics to collect the data from all the blocks.
Oh, also - have each thread analyze multiple values, so you don't have to do that much synchronization.
So, you would want to have each thread analyze multiple numbers (thousands or millions) before you do a return from the computation. So if you analyze a million numbers in your thread, you will only need %5 of that amount of space to possible hold the results of that computation.

Different way to index threads in CUDA C

I have 9x9 matrix and i flattened it into a vector of 81 elements; then i defined a grid of 9 blocks with 9 threads each for a total of 81 threads; here's a picture of the grid
Then i tried to verify what was the index related to the the thread (0,0) of block (1,1); first i calculated the i-th column and the j-th row like this:
i = blockDim.x*blockId.x + threadIdx.x = 3*1 + 0 = 3
j = blockDim.y*blockId.y + threadIdx.y = 3*1 + 0 = 3
therefore the index is:
index = N*i + j = 9*3 +3 = 30
As a matter of fact thread (0,0) of block (1,1) is actually the 30th element of the matrix;
Now here's my problem: let's say a choose a grid with 4 blocks (0,0)(1,0)(0,1)(1,1) with 4 threads each (0,0)(1,0)(0,1)(1,1)
Let's say i keep the original vector with 81 elements; what should i do to get the index of a generic element of the vector by using just 4*4 = 16 threads? i have tried the formulas written above but they don't seem to apply.
My goal is that every thread handles a single element of the vector...
A common way to have a smaller number of threads cover a larger number of data elements is to use a "grid-striding loop". Suppose I had a vector of length n elements, and I had some smaller number of threads, and I wanted to take every element, add 1 to it, and store it back in the original vector. That code could look something like this:
__global__ void my_inc_kernel(int *data, int n){
int idx = (gridDim.x*blockDim.x)*(threadIdx.y+blockDim.y*blockIdx.y) + (threadIdx.x+blockDim.x*blockIdx.x);
while(idx < n){
data[idx]++;
idx += (gridDim.x*blockDim.x)*(gridDim.y*blockDim.y);}
}
(the above is coded in browser, not tested)
The only complicated parts above are the indexing parts. The initial calculation of idx is just a typical creation/assignment of a globally unique id (idx) to each thread in a 2D threadblock/grid structure. Let's break it down:
int idx = (gridDim.x*blockDim.x)*(threadIdx.y+blockDim.y*blockIdx.y) +
(width of grid in threads)*(thread y-index)
(threadIdx.x+blockDim.x*blockIdx.x);
(thread x-index)
The amount added to idx on each pass of the while loop is the size of the 2D grid in total threads. Therefore, each iteration of the while loop does one "grid's width" of elements at a time, and then "strides" to the next grid-width, to process the next group of elements. Let's break that down:
idx += (gridDim.x*blockDim.x)*(gridDim.y*blockDim.y);
(width of grid in threads)*(height of grid in threads)
This methodology does not require that the total number of elements be evenly divisible the number of threads. The conditional check of the while-loop handles all cases of relationship between vector size and grid size.
This particular grid-striding loop methodology has the additional benefit (in terms of mapping elements to threads) that it tends to naturally promote coalesced access. The reads and writes to data vector in the code above will coalesce perfectly, due to the behavior of the grid-striding loop. You can enhance coalescing behavior in this case by choosing blocks that are a whole-number multiple of 32, but that is not central to your question.

how to read all 1's in an Array of 1's and 0's spread-ed all over the array randomly

I have an Array with 1 and 0 spread over the array randomly.
int arr[N] = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N}
Now I want to retrive all the 1's in the array as fast as possible, but the condition is I should not loose the exact position(based on index) of the array , so sorting option not valid.
So the only option left is linear searching ie O(n) , is there anything better than this.
The main problem behind linear scan is , I need to run the scan even
for X times. So I feel I need to have some kind of other datastructure
which maintains this list once the first linear scan happens, so that
I need not to run the linear scan again and again.
Let me be clear about final expectations-
I just need to find the number of 1's in a certain range of array , precisely I need to find numbers of 1's in the array within range of 40-100. So this can be random range and I need to find the counts of 1 within that range. I can't do sum and all as I need to iterate over the array over and over again because of different range requirements
I'm surprised you considered sorting as a faster alternative to linear search.
If you don't know where the ones occur, then there is no better way than linear searching. Perhaps if you used bits or char datatypes you could do some optimizations, but it depends on how you want to use this.
The best optimization that you could do on this is to overcome branch prediction. Because each value is zero or one, you can use it to advance the index of the array that is used to store the one-indices.
Simple approach:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
{
if( arr[i] ) indices[end++] = i; // Slow due to branch prediction
}
Without branching:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
{
indices[end] = i;
end += arr[i];
}
[edit] I tested the above, and found the version without branching was almost 3 times faster (4.36s versus 11.88s for 20 repeats on a randomly populated 100-million element array).
Coming back here to post results, I see you have updated your requirements. What you want is really easy with a dynamic programming approach...
All you do is create a new array that is one element larger, which stores the number of ones from the beginning of the array up to (but not including) the current index.
arr : 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1
count : 0 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 5 6 6 6 6 7
(I've offset arr above so it lines up better)
Now you can compute the number of 1s in any range in O(1) time. To compute the number of 1s between index A and B, you just do:
int num = count[B+1] - count[A];
Obviously you can still use the non-branch-prediction version to generate the counts initially. All this should give you a pretty good speedup over the naive approach of summing for every query:
int *count = new int[N+1];
int total = 0;
count[0] = 0;
for( int i = 0; i < N; i++ )
{
total += arr[i];
count[i+1] = total;
}
// to compute the ranged sum:
int range_sum( int *count, int a, int b )
{
if( b < a ) return range_sum(b,a);
return count[b+1] - count[a];
}
Well one time linear scanning is fine. Since you are looking for multiple scans across ranges of array I think that can be done in constant time. Here you go:
Scan the array and create a bitmap where key = key of array = sequence (1,2,3,4,5,6....).The value storedin bitmap would be a tuple<IsOne,cumulativeSum> where isOne is whether you have a one in there and cumulative Sum is addition of 1's as and wen you encounter them
Array = 1 1 0 0 1 0 1 1 1 0 1 0
Tuple: (1,1) (1,2) (0,2) (0,2) (1,3) (0,3) (1,4) (1,5) (1,6) (0,6) (1,7) (0,7)
CASE 1: When lower bound of cumulativeSum has a 0. Number of 1's [6,11] =
cumulativeSum at 11th position - cumulativeSum at 6th position = 7 - 3 = 4
CASE 2: When lower bound of cumulativeSum has a 1. Number of 1's [2,11] =
cumulativeSum at 11th position - cumulativeSum at 2nd position + 1 = 7-2+1 = 6
Step 1 is O(n)
Step 2 is 0(1)
Total complexity is linear no doubt but for your task where you have to work with the ranges several times the above Algorithm seems to be better if you have ample memory :)
Does it have to be a simple linear array data structure? Or can you create your own data structure which happens to have the desired properties, for which you're able to provide the required API, but whose implementation details can be hidden (encapsulated)?
If you can implement your own and if there is some guaranteed sparsity (to either 1s or 0s) then you might be able to offer better than linear performance. I see that you want to preserve (or be able to regenerate) the exact stream, so you'll have to store an array or bitmap or run-length encoding for that. (RLE will be useless if the stream is actually random rather than arbitrary but could be quite useful if there are significant sparsity or patterns with long strings of one or the other. For example a black&white raster of a bitmapped image is often a good candidate for RLE).
Let's say that your guaranteed that the stream will be sparse --- that no more than 10%, for example, of the bits will be 1s (or, conversely that more than 90% will be). If that's the case then you might model your solution on an RLE and maintain a count of all 1s (simply incremented as you set bits and decremented as you clear them). If there might be a need to quickly get the number of set bits for arbitrary ranges of these elements then instead of a single counter you can have a conveniently sized array of counters for partitions of the stream. (Conveniently-sized, in this case, means something which fits easily within memory, within your caches, or register sets, but which offers a reasonable trade off between computing a sum (all the partitions fully within the range) and the linear scan. The results for any arbitrary range is the sum of all the partitions fully enclosed by the range plus the results of linear scans for any fragments that are not aligned on your partition boundaries.
For a very, very, large stream you could even have a multi-tier "index" of partition sums --- traversing from the largest (most coarse) granularity down toward the "fragments" to either end (using the next layer of partition sums) and finishing with the linear search of only the small fragments.
Obviously such a structure represents trade offs between the complexity of building and maintaining the structure (inserting requires additional operations and, for an RLE, might be very expensive for anything other than appending/prepending) vs the expense of performing arbitrarily long linear search/increment scans.
If:
the purpose is to be able to find the number of 1s in the array at any time,
given that relatively few of the values in the array might change between one moment when you want to know the number and another moment, and
if you have to find the number of 1s in a changing array of n values m times,
... you can certainly do better than examining every cell in the array m times by using a caching strategy.
The first time you need the number of 1s, you certainly have to examine every cell, as others have pointed out. However, if you then store the number of 1s in a variable (say sum) and track changes to the array (by, for instance, requiring that all array updates occur through a specific update() function), every time a 0 is replaced in the array with a 1, the update() function can add 1 to sum and every time a 1 is replaced in the array with a 0, the update() function can subtract 1 from sum.
Thus, sum is always up-to-date after the first time that the number of 1s in the array is counted and there is no need for further counting.
(EDIT to take the updated question into account)
If the need is to return the number of 1s in a given range of the array, that can be done with a slightly more sophisticated caching strategy than the one I've just described.
You can keep a count of the 1s in each subset of the array and update the relevant subset count whenever a 0 is changed to a 1 or vice versa within that subset. Finding the total number of 1s in a given range within the array would then be a matter of adding the number of 1s in each subset that is fully contained within the range and then counting the number of 1s that are in the range but not in the subsets that have already been counted.
Depending on circumstances, it might be worthwhile to have a hierarchical arrangement in which (say) the number of 1s in the whole array is at the top of the hierarchy, the number of 1s in each 1/q th of the array is in the second level of the hierarchy, the number of 1s in each 1/(q^2) th of the array is in the third level of the hierarchy, etc. e.g. for q = 4, you would have the total number of 1s at the top, the number of 1s in each quarter of the array at the second level, the number of 1s in each sixteenth of the array at the third level, etc.
Are you using C (or derived language)? If so, can you control the encoding of your array? If, for example, you could use a bitmap to count. The nice thing about a bitmap, is that you can use a lookup table to sum the counts, though if your subrange ends aren't divisible by 8, you'll have to deal with end partial bytes specially, but the speedup will be significant.
If that's not the case, can you at least encode them as single bytes? In that case, you may be able to exploit sparseness if it exists (more specifically, the hope that there are often multi index swaths of zeros).
So for:
u8 input = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N};
You can write something like (untested):
uint countBytesBy1FromTo(u8 *input, uint start, uint stop)
{ // function for counting one byte at a time, use with range of less than 4,
// use functions below for longer ranges
// assume it's just one's and zeros, otherwise we have to test/branch
uint sum;
u8 *end = input + stop;
for (u8 *each = input + start; each < end; each++)
sum += *each;
return sum;
}
countBytesBy8FromTo(u8 *input, uint start, uint stop)
{
u64 *chunks = (u64*)(input+start);
u64 *end = chunks + ((start - stop) >> 3);
uint sum = countBytesBy1FromTo((u8*)end, 0, stop - (u8*)end);
for (; chunks < end; chunks++)
{
if (*chunks)
{
sum += countBytesBy1FromTo((u8*)chunks, 0, 8);
}
}
}
The basic trick, is exploiting the ability to cast slices of your target array to single entities your language can look at in one swoop, and test by inference if ANY of the values of it are zeros, and then skip the whole block. The more zeros, the better it will work. In the case where your large cast integer always has at least one, this approach just adds overhead. You might find that using a u32 is better for your data. Or that adding a u32 test between the 1 and 8 helps. For datasets where zeros are much more common than ones, I've used this technique to great advantage.
Why is sorting invalid? You can clone the original array, sort the clone, and count and/or mark the locations of the 1s as needed.

Algorithm for detecting duplicates in a dataset which is too large to be completely loaded into memory

Is there an optimal solution to this problem?
Describe an algorithm for finding duplicates in a file of one million phone numbers. The algorithm, when running, would only have two megabytes of memory available to it, which means you cannot load all the phone numbers into memory at once.
My 'naive' solution would be an O(n^2) solution which iterates over the values and just loads the file in chunks instead of all at once.
For i = 0 to 999,999
string currentVal = get the item at index i
for j = i+1 to 999,999
if (j - i mod fileChunkSize == 0)
load file chunk into array
if data[j] == currentVal
add currentVal to duplicateList and exit for
There must be another scenario were you can load the whole dataset in a really unique way and verify if a number is duplicated. Anyone have one?
Divide the file into M chunks, each of which is large enough to be sorted in memory. Sort them in memory.
For each set of two chunks, we will then carry out the last step of mergesort on two chunks to make one larger chunk (c_1 + c_2) (c_3 + c_4) .. (c_m-1 + c_m)
Point at the first element on c_1 and c_2 on disk, and make a new file (we'll call it c_1+2).
if c_1's pointed-to element is a smaller number than c_2's pointed-to element, copy it into c_1+2 and point to the next element of c_1.
Otherwise, copy c_2's pointed element into and point to the next element of c_2.
Repeat the previous step until both arrays are empty. You only need to use the space in memory needed to hold the two pointed-to numbers. During this process, if you encounter c_1 and c_2's pointed-to elements being equal, you have found a duplicate - you can copy it in twice and increment both pointers.
The resulting m/2 arrays can be recursively merged in the same manner- it will take log(m) of these merge steps to generate the correct array. Each number will be compared against each other number in a way that will find the duplicates.
Alternately, a quick and dirty solution as alluded to by #Evgeny Kluev is to make a bloom filter which is as large as you can reasonably fit in memory. You can then make a list of the index of each element which fails the bloom filter and loop through the file a second time in order to test these members for duplication.
I think Airza's solution is heading towards a good direction, but since sorting is not what you want, and it is more expensive you can do the following by combining with angelatlarge's approach:
Take a chunk C that fits in the memory of size M/2 .
Get the chunk Ci
Iterate through i and hash each element into a hash-table. If the element already exists then you know it is a duplicate, and you can mark it as a duplicate. (add its index into an array or something).
Get the next chunk Ci+1 and check if any of the key already exists in the hash table. If an element exists mark it for deletion.
Repeat with all chunks until you know they will not contain any duplicates from chunk Ci
Repeat steps 1,2 with chunk Ci+1
Deleted all elements marked for deletion (could be done during, whatever is more appropriate, it might be more expensive to delete one at the time if you have to shift everything else around).
This runs in O((N/M)*|C|) , where |C| is the chunk size. Notice that if M > 2N, then we only have one chunk, and this runs in O(N), which is optimal for deleting duplicates.
We simply hash them and make sure that all collisions are deleted.
Edit: Per requested, I'm providing details:
* N is the number phone numbers.
The size of the chunk will depend on the memory, it should be of size M/2.
This is the size of memory that will load a chunk of the file, since the whole file is too big to be loaded to memory.
This leaves another M/2 bytes to keep the hash table2, and/or a duplicate list1.
Hence, there should be N/(M/2) chunks, each of size |C| = M/2
The run time will be the number of chunks(N/(M/2)), times the size of each chunk |C| (or M/2). Overall, this should be linear (plus or minus the overhead of changing from one chunk to the other, which is why the best way to describe it is O( (N/M) * |C| )
a. Loading a chunk Ci. O(|C|)
b. Iterate through each element, test and set if not there O(1) will be hashed in which insertion and lookup should take.
c. If the element is already there, you can delete it.1
d. Get the next chunk, rinse and repeat (2N/M chunks, so O(N/M))
1 Removing an element might cost O(N), unless we keep a list and remove them all in one go, by avoiding to shift all the remaining elements whenever an element is removed.
2 If the phone numbers can all be represented as an integer < 232 - 1, we can avoid having a full hash-table and just use a flag map, saving piles of memory (we'll only need N-bits of memory)
Here's a somewhat detailed pseudo-code:
void DeleteDuplicate(File file, int numberOfPhones, int maxMemory)
{
//Assume each 1'000'000 number of phones that fit in 32-bits.
//Assume 2MB of memory
//Assume that arrays of bool are coalesced into 8 bools per byte instead of 1 bool per byte
int chunkSize = maxMemory / 2; // 2MB / 2 / 4-byes per int = 1MB or 256K integers
//numberOfPhones-bits. C++ vector<bool> for example would be space efficient
// Coalesced-size ~= 122KB | Non-Coalesced-size (worst-case) ~= 977KB
bool[] exists = new bool[numberOfPhones];
byte[] numberData = new byte[chunkSize];
int fileIndex = 0;
int bytesLoaded;
do //O(chunkNumber)
{
bytesLoaded = file.GetNextByes(chunkSize, /*out*/ numberData);
List<int> toRemove = new List<int>(); //we still got some 30KB-odd to spare, enough for some 6 thousand-odd duplicates that could be found
for (int ii = 0; ii < bytesLoaded; ii += 4)//O(chunkSize)
{
int phone = BytesToInt(numberData, ii);
if (exists[phone])
toRemove.push(ii);
else
exists[phone] = true;
}
for (int ii = toRemove.Length - 1; ii >= 0; --ii)
numberData.removeAt(toRemove[ii], 4);
File.Write(fileIndex, numberData);
fileIndex += bytesLoaded;
} while (bytesLoaded > 0); // while still stuff to load
}
If you can store temporary files you can load the file in chunks, sort each chunk, write it to a file, and then iterate through the chunks and look for duplicates. You can easily tell if a number is duplicated by comparing it to the next number in the file and the next number in each of the chunks. Then move to the next lowest number of all of the chunks and repeat until you run out of numbers.
Your runtime is O(n log n) due to the sorting.
I like the #airza solution, but perhaps there is another algorithm to consider: maybe one million phone numbers cannot be loaded into memory at once because they are expressed inefficiently, i.e. using more bytes per phone number than necessary. In that case, you might be able to have an efficient solution by hashing the phone numbers and storing the hashes in a (hash) table. Hash tables support dictionary operations (such as in) that let you find dupes easily.
To be more concrete about it, if each phone number is 13 bytes (such as a string in the format (NNN)NNN-NNNN), the string represents one of a billion numbers. As an integer, this can be stored in 4 bytes (instead of 13 in the string format). We then might be able to store this 4 byte "hash" in a hash table, because now our 1 billion hashed numbers take up as much space as 308 million numbers, not one billion. Ruling out impossible numbers (everything in area codes 000, 555, etc) might allow us reduce the hash size further.

Is there an efficient data structure for row and column swapping?

I have a matrix of numbers and I'd like to be able to:
Swap rows
Swap columns
If I were to use an array of pointers to rows, then I can easily switch between rows in O(1) but swapping a column is O(N) where N is the amount of rows.
I have a distinct feeling there isn't a win-win data structure that gives O(1) for both operations, though I'm not sure how to prove it. Or am I wrong?
Without having thought this entirely through:
I think your idea with the pointers to rows is the right start. Then, to be able to "swap" the column I'd just have another array with the size of number of columns and store in each field the index of the current physical position of the column.
m =
[0] -> 1 2 3
[1] -> 4 5 6
[2] -> 7 8 9
c[] {0,1,2}
Now to exchange column 1 and 2, you would just change c to {0,2,1}
When you then want to read row 1 you'd do
for (i=0; i < colcount; i++) {
print m[1][c[i]];
}
Just a random though here (no experience of how well this really works, and it's a late night without coffee):
What I'm thinking is for the internals of the matrix to be a hashtable as opposed to an array.
Every cell within the array has three pieces of information:
The row in which the cell resides
The column in which the cell resides
The value of the cell
In my mind, this is readily represented by the tuple ((i, j), v), where (i, j) denotes the position of the cell (i-th row, j-th column), and v
The would be a somewhat normal representation of a matrix. But let's astract the ideas here. Rather than i denoting the row as a position (i.e. 0 before 1 before 2 before 3 etc.), let's just consider i to be some sort of canonical identifier for it's corresponding row. Let's do the same for j. (While in the most general case, i and j could then be unrestricted, let's assume a simple case where they will remain within the ranges [0..M] and [0..N] for an M x N matrix, but don't denote the actual coordinates of a cell).
Now, we need a way to keep track of the identifier for a row, and the current index associated with the row. This clearly requires a key/value data structure, but since the number of indices is fixed (matrices don't usually grow/shrink), and only deals with integral indices, we can implement this as a fixed, one-dimensional array. For a matrix of M rows, we can have (in C):
int RowMap[M];
For the m-th row, RowMap[m] gives the identifier of the row in the current matrix.
We'll use the same thing for columns:
int ColumnMap[N];
where ColumnMap[n] is the identifier of the n-th column.
Now to get back to the hashtable I mentioned at the beginning:
Since we have complete information (the size of the matrix), we should be able to generate a perfect hashing function (without collision). Here's one possibility (for modestly-sized arrays):
int Hash(int row, int column)
{
return row * N + column;
}
If this is the hash function for the hashtable, we should get zero collisions for most sizes of arrays. This allows us to read/write data from the hashtable in O(1) time.
The cool part is interfacing the index of each row/column with the identifiers in the hashtable:
// row and column are given in the usual way, in the range [0..M] and [0..N]
// These parameters are really just used as handles to the internal row and
// column indices
int MatrixLookup(int row, int column)
{
// Get the canonical identifiers of the row and column, and hash them.
int canonicalRow = RowMap[row];
int canonicalColumn = ColumnMap[column];
int hashCode = Hash(canonicalRow, canonicalColumn);
return HashTableLookup(hashCode);
}
Now, since the interface to the matrix only uses these handles, and not the internal identifiers, a swap operation of either rows or columns corresponds to a simple change in the RowMap or ColumnMap array:
// This function simply swaps the values at
// RowMap[row1] and RowMap[row2]
void MatrixSwapRow(int row1, int row2)
{
int canonicalRow1 = RowMap[row1];
int canonicalRow2 = RowMap[row2];
RowMap[row1] = canonicalRow2
RowMap[row2] = canonicalRow1;
}
// This function simply swaps the values at
// ColumnMap[row1] and ColumnMap[row2]
void MatrixSwapColumn(int column1, int column2)
{
int canonicalColumn1 = ColumnMap[column1];
int canonicalColumn2 = ColumnMap[column2];
ColumnMap[row1] = canonicalColumn2
ColumnMap[row2] = canonicalColumn1;
}
So that should be it - a matrix with O(1) access and mutation, as well as O(1) row swapping and O(1) column swapping. Of course, even an O(1) hash access will be slower than the O(1) of array-based access, and more memory will be used, but at least there is equality between rows/columns.
I tried to be as agnostic as possible when it comes to exactly how you implement your matrix, so I wrote some C. If you'd prefer another language, I can change it (it would be best if you understood), but I think it's pretty self descriptive, though I can't ensure it's correctedness as far as C goes, since I'm actually a C++ guys trying to act like a C guy right now (and did I mention I don't have coffee?). Personally, writing in a full OO language would do it the entrie design more justice, and also give the code some beauty, but like I said, this was a quickly whipped up implementation.

Resources