I am trying two find the time complexity of a recursive function that merges n number of files.
My solution is T(n)=kc+T(n-(k+1)) where n > 0, T(n)=T(0) where n=0.
Is this correct or is there any other way of finding the time complexity?
Here is the pseudo code,
//GLOBAL VARIABLES
int nRecords = 0...k; //assume there are k records
int numFiles = 0...n; //assume there are n files
String mainArray[0...nRecords]; //main array that stores the file records
void mergeFiles(numFiles) { //params numFiles
fstream file; //file variable
if (numFiles == 0) {
ofstream outfile; //file output variable
outfile.open(directory / mergedfile); // point variable to directory
for (int i = 0; i < sizeOf(mainArray); i++) {
oufile << mainArray[i]; // write content of mainArray to outfile
}
outfile.close(); //close once operation is done
} else {
int i = 0; //file index counter
file.open(directory / nextfile); //open file to be read
if (file.isOpen()) {
while (!file.eof() && i < sizeOf(mainArray)) {
file >> mainArray[i]; //copy contents of file to mainArray
i++; //increase array index
}
}
file.close(); //close once operation is done
mergeFiles(numFiles - 1); //recurse function
}
}
int main() {
mergeFiles(numFiles); //call mergeFile function to main
}
Going by your formula.
T(n)= kc+T(n-(k+1)) = kc+kc+T(n-(k+1)-(k+1)) = kc+kc+...+T(0) = ...
= kc*(n/(k+1)) ~ nc = O(n).
The definition of k is a bit ambiguous in your question, because the formula you provided for T(n) seems to assume you process k records per file, while the definition of mainArray in the code suggests that k represents the total number of records, not the number of records in an individual file.
I will first assume the second definition of k is the correct one, so you have:
n = number of files
k = total number of records in these files = size of array
Time complexity of read/write operations
I think you assume the following two statements -- which read/write one record -- run each in constant time:
file >> mainArray[i];
outfile << mainArray[i];
Note that the time needed for such operations is generally dependent on the size of the record. But as you did not provide that size as something to consider, I will assume records have a constant size, and thus these operations can be considered to run in O(1), i.e. constant time.
About recursion
Although you use recursion, it really concerns tail-recursion, and so the time complexity is not any different as for an iterative algorithm. Either way, the else block is executed n times.
It is in fact not so straightforward to calculate the time complexity with a recursive formula, as you don't know how many records there are in one file, only in all files together. You can work around this, and artificially assume there are k/n records in each file, but I find it much more intuitive to perform the measurement based on the absolute number of times the else block is executed, without the need to express this in a recursive formula.
Measurements
The body of the inner while loop can in total execute k times at the most, and given that you assume there are just as many records in your files, it will execute exactly k times in total.
The final part (where numfiles == 0) has a for loop that also executes k times.
So the ingredients determining the time complexity are:
A constant time for opening/closing a file, multiplied by n
A constant time for reading/writing a record, multiplied by k
So the time complexity is O(n+k)
If definition of k is different
If k should denote the number of records in one file, then your code is wrong, as the size of the array has then to be n.k, instead of k. Suppose that you still intended that, then with a similar reasoning the time complexity is O(n.k)
Note concerning the correctness of the program
In real situation you would have to make sure the size of your array corresponds to the total number of records in your file, and not just assume it is the case. If the array turns out to be smaller you would not be able to store some records; and if on the other hand the array is greater, the code for dumping the array into the output file would include array elements that were never initialised.
You would therefore better use an array with a dynamic size (a stack), so its size corresponds exactly to the number of records that have been actually read into it.
Related
Here is a function I wrote in pseudo code:
partition(itemList) {
numPackets = calculateNumOfPackets(listSize, packetSize);
indexOfNextItem = 0;
packetQueue = initialize(numPackets);
for (i = 0; < numPackets; i++) {
// Initialized as a fixed-size list
Packet p = createNew(packetSize);
for (j = indexOfNextItem; j < itemList.length; j++) {
// hasRoom() returns false when packet is at capacity
if (p.hasRoom())
// Guaranteed to run in constant time due to predefined capacity
p.add(item[j]);
else {
indexOfNextItem = j; // keep track of next index for inner loop
break;
}
} // end inner
packetQueue.add(p);
} // end outer
return packetQueue;
}
As I hope is clear, this just does partitioning and returns a partitioned queue of "packets" that contains the items of the input list. Now I'm pretty sure this is running in linear time because the inner loop isn't running fully for each iteration of the outer loop; it's only running until the current packet is full, at which point it keeps a cache of the index where it left off and then breaks out of the inner loop. As a result I suspect this is actually running in linear time.
Am I understanding this correctly?
If createNew(packetSize) is linear in packetSize, initialize(numPackets) is linear in numPackets and all of p.hasRoom(), p.add(), itemList[i] and packetQueue.add(p) are O(1), your algorithm is O(listSize) (assuming listSize is len(itemList).
The sketch of the proof is that each inner loop will execute at most packetSize O(1) operations, and that inner loop will be executed at most ceil(listSize / packetSize), thus the total number of operations will run at most (numPackets + 1) * packetSize * n, where n is a constant related to the number of operations done in each loop.
One of your comments states that:
Given a list of items, the algorithm is supposed to add a certain
number of these items to packets (represented as fixed-size lists) and
returns a queue of said packets. So if the input list had 100 items,
and the max packet capacity allowed is 10, then you'd get a queue of
10 packets each having 10 items.
If this is true, then, since each item is only included in 1 packet, your algorithm is linear in the number of items (O(itemList.length)) - assuming that placing items into packets is constant-time.
Counting nested loops only makes sense if the loop counters are independent. If you know that, as in this case, every item in a list is being visited once and only once, and that visit is constant-time, you can confidently state that such code is linear in the number of items.
I was curious to find out if you have a two way iterating for loop does it decrease the time complexity and if so by how much? I know most people do a standard for loop
For (int index = 0; index < count - 1; index++)
{
if ( Something(index) == "Hello")
{
return true
}
}
Return False
How much better would it be if you have a two way iterating for loop to reduce time?
int index2 = count - 1;
For ( int index = 0; index < count - 1; index++)
{
if(Something(index) == "Hello" || Something(index2) == "Hello"
{
return true;
}
index2--;
if ( index = index2)
{
return false;
}
}
Given no extra information about the underlying data in the array both actually will be the same order of complexity in terms of array lookups and comparison operations. The order of complexity is not about how many times the loop runs through but rather the total number of operations performed. The first version loops n times and does 1 operation per loop which is n*1 operations in total. The second does n/2 loops with 2 operations per loop which is (n/2)*2=n operations. You can see that these are the same.
However when you practically implement it the second version will do worse on many architectures because of extra cache misses. If the start and end are far away you end up having to go back to main memory to load it into the cache a lot. This is much more expensive than a simple comparison. This is why compilers might optimize the code by transforming it to do something like the first form.
The time complexity is the same, since the complexity is by definition independent of any constant factor (like 2).
I'm looking for a data structure that roughly corresponds to (in Java terms) Map<Set<int>, double>. Essentially a set of sets of labeled marbles, where each set of marbles is associated with a scalar. I want it to be able to efficiently handle the following operations:
Add a given integer to every set.
Remove every set that contains (or does not contain) a given integer, or at least set the associated double to 0.
Union two of the maps, adding together the doubles for sets that appear in both.
Multiply all of the doubles by a given double.
Rarely, iterate over the entire map.
under the following conditions:
The integers will fall within a constrained range (between 1 and 10,000 or so); the exact range will be known at compile-time.
Most of the integers within the range (80-90%) will never be used, but which ones will not be easily determinable until the end of the calculation.
The number of integers used will almost always still be over 100.
Many of the sets will be very similar, differing only by a few elements.
It may be possible to identify certain groups of integers that frequently appear only in sequential order: for example, if a set contains the integers 27 and 29 then it (almost?) certainly contains 28 as well.
It may be possible to identify these groups prior to running the calculation.
These groups would typically have 100 or so integers.
I've considered tries, but I don't see a good way to handle the "remove every set that contains a given integer" operation.
The purpose of this data structure would be to represent discrete random variables and permit addition, multiplication, and scalar multiplication operations on them. Each of these discrete random variables would ultimately have been created by applying these operations to a fixed (at compile-time) set of independent Bernoulli random variables (i.e. each takes the value 1 or 0 with some probability).
The systems being modeled are close to being representable as a time-inhomogeneous Markov chains (which would of course simplify this immensely) but, unfortunately, it is essential to track the duration since various transitions.
Here's a data structure, that can do all of your operations pretty efficiently:
I'm going to refer to it as a BitmapArray for this explanation.
Thinking about it, apparently for just the operations you have described a sorted array with bitmaps as keys and weights(your doubles) as values will be pretty efficient.
The bitmaps are what maintain membership in your set. Since you said the range of integers in the set are between 1-10,000, we can maintain information about any set with a bitmap of length 10,000.
It's gonna be tough sorting an array where the keys can be as big as 2^10000, but you can be smart about implementing the comparison function in the following way:
Iterate from left to right on the two bitmaps
XOR the bits on each index
Say you get a 1 at ith position
Whichever bitmap has 1 at ith position is greater
If you never get a 1 they're equal
I know this is still a slow comparison.
But not too slow, Here's a benchmark fiddle I did on bitmaps with length 10000.
This is in Javascript, if you're going to write in Java, it's going to perform even better.
function runTest() {
var num = document.getElementById("txtValue").value;
num = isNaN(num * 1) ? 0 : num * 1;
/*For integers in the range 1-10,000 the worst case for comparison are any equal integers which will cause the comparision to iterate over the whole BitArray*/
bitmap1 = convertToBitmap(10000, num);
bitmap2 = convertToBitmap(10000, num);
before = new Date().getMilliseconds();
var result = firstIsGreater(bitmap1, bitmap2, 10000);
after = new Date().getMilliseconds();
alert(result + " in time: " + (after-before) + " ms");
}
function convertToBitmap(size, number) {
var bits = new Array();
var q = number;
do {
bits.push(q % 2);
q = Math.floor(q / 2);
} while (q > 0);
xbitArray = new Array();
for (var i = 0; i < size; i++) {
xbitArray.push(0);
}
var j = xbitArray.length - 1;
for (var i = bits.length - 1; i >= 0; i--) {
xbitArray[j] = bits[i];
j--
}
return xbitArray;
}
function firstIsGreater(bitArray1, bitArray2, lengthOfArrays) {
for (var i = 0; i < lengthOfArrays; i++) {
if (bitArray1[i] ^ bitArray2[i]) {
if (bitArray1[i]) return true;
else return false;
}
}
return false;
}
document.getElementById("btnTest").onclick = function (e) {
runTest();
};
Also, remember that you only have to do this once, when building your BitmapArray (or while taking unions) and then it's going to become pretty efficient for the operations you'd do most often:
Note: N is the length of the BitmapArray.
Add integer to every set: Worst/best case O(N) time. Flip a 0 to 1 in each bitmap.
Remove every set that contains a given integer: Worst case O(N) time.
For each bitmap check the bit that represents the given integer, if 1 mark it's index.
Compress the array by deleting all marked indices.
If you're okay with just setting the weights to 0 it'll be even more efficient. This also makes it very easy if you want to remove all sets that have any element in a given set.
Union of two maps: Worst case O(N1+N2) time. Just like merging two sorted arrays, except you have to be smart about comparisons once more.
Multiply all of the doubles by a given double: Worst/best case O(N) time. Iterate and multiply each value by the input double.
Iterate over the BitmapArray: Worst/best case O(1) time for next element.
My data structure is a linked list of blocks. A block contains 31 elements of 4 byte and one 4 byte pointer to the next block or NULL(in summary 128 bytes per block). I add elements from time to time. If the last block is full, I add another block via pointer.
One objective is to use as less memory (= blocks) as possible and having no free space between two elements in a block.
This setting is fix. All code runs on a 32-bit ARM Cortex-A8 CPU with NEON pipeline.
Question:
How to find a specific element in that data structure as quickly as possible?
Approach (right now):
I use sorted blocks and binary search to check for an element (9 bit of the 4 byte are the search criteria). If the desired element is not in the current block I jump to the next block. If the element is not in the last block and the last block is not yet full, I use the result of the binary search to insert the new element (if necessary I make space using memmove within this block). Thus all blocks are always sorted.
Do you have an idea to make that faster?
This is how I search right now: (q->getPosition() is an inline function that just extracts the 9-bit position from the element via "& bitmask")
do
{
// binary search algorithm (bsearch)
// from http://www.google.com/codesearch/
// p?hl=en#qoCVjtE_vOw/gcc4/trunk/gcc-
// 4.4.3/libiberty/bsearch.c&q=bsearch&sa=N&cd=2&ct=rc
base = &(block->points[0]);
if (block->next == NULL)
{
pointsInBlock = pointsInLastBlock;
stop = true;
}
else
{
block = block->next;
}
for (lim = pointsInBlock; lim != 0; lim >>= 1)
{
q = base + (lim >> 1);
cmp = quantizedPosition - q->getPosition();
if (cmp > 0)
{
// quantizedPosition > q: move right
base = q + 1;
lim--;
}
else if (cmp == 0)
{
// We found the QuantPoint
*outQuantPoint = q;
return true;
}
// else move left
}
}
while (!stop);
Since the bulk of the time is spent in the within-block search, that needs to be as fast as possible. Since the number of elements is fixed, you can completely unroll that loop, as in:
if (key < a[16]){
if (key < a[8]){
...
}
else { // key >= a[8] && key < a[16]
...
}
}
else { // key >= a[16]
if (key < a[24]){
...
}
else { // key >= a[24]
...
}
}
Study the generated assembly language and single-step it in a debugger, to make sure the compiler's giving you good code.
You might want to write a little program to print out the above code, as it will be hard to write by hand, or possibly generate it with macros.
ADDED: Just noticed your 9-bit search criterion. In that case, just pre-allocate an array of 512 4-byte words, and index it directly. That's the fastest, and the least code.
ALSO ADDED: If you need to keep your block structure, there's another way to do the unrolled binary search. It's the Jon Bentley method:
i = 0;
if (key >= a[i+16]) i += 16;
if (key >= a[i+ 8]) i += 8;
if (key >= a[i+ 4]) i += 4;
if (key >= a[i+ 2]) i += 2;
if (i < 30 && key >= a[i+ 1]) i += 1; // this excludes 31
if (key == a[i]) // then key is found
That's slower than the if-tree above, because of manipulating i, but could be substantially less code.
Let the number of elements in each block be m and the total number of blocks currently in the list be n. Then the current time complexity of you algorithm is O(n log m).
If you cannot move elements once they are added to a block, then I don't think you can do better in terms of time complexity than what you are already doing. (You could keep track of the maximum and minimum elements in a block, and skip the blocks if the element does not lie in this range. But this is not going to give you much gain. This will also waste space keeping track of the minimum and maximum for each block)
If you can afford to spend time while inserting the element and can move elements from one block to another, then here is a scheme that has time complexity O(log (mn)).
Basically, you keep all elements in sorted order. When a new element has to be inserted, binary search across block boundaries and insert it in its correct location, shifting elements to create space. This will lead to O(nm) time while inserting elements but O(log (mn)) when finding an element.
if this search criterion for an element is fixed, you had better to move the searching into a separate index structure, because the maximal number of elements you distinguish by your search criterion is only 2^9 = 512 indexes, so the maximal size of the search index would be (2 + 4)*512 = 3072, but you could surely use other that static one if you needed, saving some memory. Right now, imagine it as a field of 512 pairs <9-bit index, direct address>, that should be very fast (only one NULL-check and dereference call respectively).
Generally the answer on your question also depend on what other operations you want to perform on your structure and how frequently each of them (including the search ability). If all you want is search(9 bits)->add/modify/read, the your block structure would be useless.
You could write them here and maybe add what language you'r using.
Edit 3:
I just noticed you can't change the blocks' size. But is your search for efficiency reasons only, or do you need the elements of list to be unique (by those 9 bits)?
There is a file that contains 10G(1000000000) number of integers, please find the Median of these integers. you are given 2G memory to do this. Can anyone come up with an reasonable way? thanks!
Create an array of 8-byte longs that has 2^16 entries. Take your input numbers, shift off the bottom sixteen bits, and create a histogram.
Now you count up in that histogram until you reach the bin that covers the midpoint of the values.
Pass through again, ignoring all numbers that don't have that same set of top bits, and make a histogram of the bottom bits.
Count up through that histogram until you reach the bin that covers the midpoint of the (entire list of) values.
Now you know the median, in O(n) time and O(1) space (in practice, under 1 MB).
Here's some sample Scala code that does this:
def medianFinder(numbers: Iterable[Int]) = {
def midArgMid(a: Array[Long], mid: Long) = {
val cuml = a.scanLeft(0L)(_ + _).drop(1)
cuml.zipWithIndex.dropWhile(_._1 < mid).head
}
val topHistogram = new Array[Long](65536)
var count = 0L
numbers.foreach(number => {
count += 1
topHistogram(number>>>16) += 1
})
val (topCount,topIndex) = midArgMid(topHistogram, (count+1)/2)
val botHistogram = new Array[Long](65536)
numbers.foreach(number => {
if ((number>>>16) == topIndex) botHistogram(number & 0xFFFF) += 1
})
val (botCount,botIndex) =
midArgMid(botHistogram, (count+1)/2 - (topCount-topHistogram(topIndex)))
(topIndex<<16) + botIndex
}
and here it is working on a small set of input data:
scala> medianFinder(List(1,123,12345,1234567,123456789))
res18: Int = 12345
If you have 64 bit integers stored, you can use the same strategy in 4 passes instead.
You can use the Medians of Medians algorithm.
If the file is in text format, you may be able to fit it in memory just by converting things to integers as you read them in, since an integer stored as characters may take more space than an integer stored as an integer, depending on the size of the integers and the type of text file. EDIT: You edited your original question; I can see now that you can't read them into memory, see below.
If you can't read them into memory, this is what I came up with:
Figure out how many integers you have. You may know this from the start. If not, then it only takes one pass through the file. Let's say this is S.
Use your 2G of memory to find the x largest integers (however many you can fit). You can do one pass through the file, keeping the x largest in a sorted list of some sort, discarding the rest as you go. Now you know the x-th largest integer. You can discard all of these except for the x-th largest, which I'll call x1.
Do another pass through, finding the next x largest integers less than x1, the least of which is x2.
I think you can see where I'm going with this. After a few passes, you will have read in the (S/2)-th largest integer (you'll have to keep track of how many integers you've found), which is your median. If S is even then you'll average the two in the middle.
Make a pass through the file and find count of integers and minimum and maximum integer value.
Take midpoint of min and max, and get count, min and max for values either side of the midpoint - by again reading through the file.
partition count > count => median lies within that partition.
Repeat for the partition, taking into account size of 'partitions to the left' (easy to maintain), and also watching for min = max.
Am sure this'd work for an arbitrary number of partitions as well.
Do an on-disk external mergesort on the file to sort the integers (counting them if that's not already known).
Once the file is sorted, seek to the middle number (odd case), or average the two middle numbers (even case) in the file to get the median.
The amount of memory used is adjustable and unaffected by the number of integers in the original file. One caveat of the external sort is that the intermediate sorting data needs to be written to disk.
Given n = number of integers in the original file:
Running time: O(nlogn)
Memory: O(1), adjustable
Disk: O(n)
Check out Torben's method in here:http://ndevilla.free.fr/median/median/index.html. It also has implementation in C at the bottom of the document.
My best guess that probabilistic median of medians would be the fastest one. Recipe:
Take next set of N integers (N should be big enough, say 1000 or 10000 elements)
Then calculate median of these integers and assign it to variable X_new.
If iteration is not first - calculate median of two medians:
X_global = (X_global + X_new) / 2
When you will see that X_global fluctuates not much - this means that you found approximate median of data.
But there some notes :
question arises - Is median error acceptable or not.
integers must be distributed randomly in a uniform way, for solution to work
EDIT:
I've played a bit with this algorithm, changed a bit idea - in each iteration we should sum X_new with decreasing weight, such as:
X_global = k*X_global + (1.-k)*X_new :
k from [0.5 .. 1.], and increases in each iteration.
Point is to make calculation of median to converge fast to some number in very small amount of iterations. So that very approximate median (with big error) is found between 100000000 array elements in only 252 iterations !!! Check this C experiment:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define ARRAY_SIZE 100000000
#define RANGE_SIZE 1000
// probabilistic median of medians method
// should print 5000 as data average
// from ARRAY_SIZE of elements
int main (int argc, const char * argv[]) {
int iter = 0;
int X_global = 0;
int X_new = 0;
int i = 0;
float dk = 0.002;
float k = 0.5;
srand(time(NULL));
while (i<ARRAY_SIZE && k!=1.) {
X_new=0;
for (int j=i; j<i+RANGE_SIZE; j++) {
X_new+=rand()%10000 + 1;
}
X_new/=RANGE_SIZE;
if (iter>0) {
k += dk;
k = (k>1.)? 1.:k;
X_global = k*X_global+(1.-k)*X_new;
}
else {
X_global = X_new;
}
i+=RANGE_SIZE+1;
iter++;
printf("iter %d, median = %d \n",iter,X_global);
}
return 0;
}
Opps seems i'm talking about mean, not median. If it is so, and you need exactly median, not mean - ignore my post. In any case mean and median are very related concepts.
Good luck.
Here is the algorithm described by #Rex Kerr implemented in Java.
/**
* Computes the median.
* #param arr Array of strings, each element represents a distinct binary number and has the same number of bits (padded with leading zeroes if necessary)
* #return the median (number of rank ceil((m+1)/2) ) of the array as a string
*/
static String computeMedian(String[] arr) {
// rank of the median element
int m = (int) Math.ceil((arr.length+1)/2.0);
String bitMask = "";
int zeroBin = 0;
while (bitMask.length() < arr[0].length()) {
// puts elements which conform to the bitMask into one of two buckets
for (String curr : arr) {
if (curr.startsWith(bitMask))
if (curr.charAt(bitMask.length()) == '0')
zeroBin++;
}
// decides in which bucket the median is located
if (zeroBin >= m)
bitMask = bitMask.concat("0");
else {
m -= zeroBin;
bitMask = bitMask.concat("1");
}
zeroBin = 0;
}
return bitMask;
}
Some test cases and updates to the algorithm can be found here.
I was also asked the same question and i couldn't tell an exact answer so after the interview i went through some books on interviews and here is what i found from Cracking The Coding interview book.
Example: Numbers are randomly generated and stored into an (expanding) array. How
wouldyoukeep track of the median?
Our data structure brainstorm might look like the following:
• Linked list? Probably not. Linked lists tend not to do very well with accessing and
sorting numbers.
• Array? Maybe, but you already have an array. Could you somehow keep the elements
sorted? That's probably expensive. Let's hold off on this and return to it if it's needed.
• Binary tree? This is possible, since binary trees do fairly well with ordering. In fact, if the binary search tree is perfectly balanced, the top might be the median. But, be careful—if there's an even number of elements, the median is actually the average
of the middle two elements. The middle two elements can't both be at the top. This is probably a workable algorithm, but let's come back to it.
• Heap? A heap is really good at basic ordering and keeping track of max and mins.
This is actually interesting—if you had two heaps, you could keep track of the bigger
half and the smaller half of the elements. The bigger half is kept in a min heap, such
that the smallest element in the bigger half is at the root.The smaller half is kept in a
max heap, such that the biggest element of the smaller half is at the root. Now, with
these data structures, you have the potential median elements at the roots. If the
heaps are no longer the same size, you can quickly "rebalance" the heaps by popping
an element off the one heap and pushing it onto the other.
Note that the more problems you do, the more developed your instinct on which data
structure to apply will be. You will also develop a more finely tuned instinct as to which of these approaches is the most useful.