For the sake of security, I probably can't post any of our files' code, but I can describe what's going on. Basically, we have standalone items and others that are composed of smaller parts. The current system we have in place works like this. Assume we have n items and m parts for each of the kits, where m is not constant and less than n in all cases.
for(all items){
process item, record available quantity and associated costs
write to database
process item, get number of pre-assembled kits
for(each part){
determine how many are used to produce one kit
divide total number of this specific part by number required, keep track of smallest result
add cost of this item to total production cost of item
use smallest resulting number to determine total available quantity for this kit
write record to database
At first, I wanted to say that the total time taken for this is O(n^2) but I'm not convinced that's correct given that about n/3 of all items are kits and m generally ranges between 3 to 8 parts. What would this come out to? I've tested it a few times and it feels like it's not optimized.

From the pseudo-code that you have posted it is fairly easy to work out the cost. You have a loop over n items (thus this is O(n)), and inside this loop have another loop of O(m). As you worked out nested loops mean that the orders are multiplied: if they were both of Order n then this would give O(n^2); instead it is O(mn).
This has assumed that the processing that you have mentioned runs in constant time (i.e. is independent of the size of the inputs). If those descriptions hide some other processing time then this analysis will be incorrect.


Efficiently search for pairs of numbers in various rows

Imagine you have N distinct people and that you have a record of where these people are, exactly M of these records to be exact.
For example
So you can see that 'person 1' is at the same place with 'person 50' three times. Here M = 3 obviously since there's only 3 lines. My question is given M of these lines, and a threshold value (i.e person A and B have been at the same place more than threshold times), what do you suggest the most efficient way of returning these co-occurrences?
So far I've built an N by N table, and looped through each row, incrementing table(N,M) every time N co occurs with M in a row. Obviously this is an awful approach and takes 0(n^2) to O(n^3) depending on how you implent. Any tips would be appreciated!
There is no need to create the table. Just create a hash/dictionary/whatever your language calls it. Then in pseudocode:
answer = []
for S in sets:
for (i, j) in pairs from S:
if threshold == count[(i,j)]:
If you have M sets of size of size K the running time will be O(M*K^2).
If you want you can actually keep the list of intersecting sets in a data structure parallel to count without changing the big-O.
Furthermore the same algorithm can be readily implemented in a distributed way using a map-reduce. For the count you just have to emit a key of (i, j) and a value of 1. In the reduce you count them. Actually generating the list of sets is similar.
The known concept for your case is Market Basket analysis. In this context, there are different algorithms. For example Apriori algorithm can be using for your case in a specific case for sets of size 2.
Moreover, in these cases to finding association rules with specific supports and conditions (which for your case is the threshold value) using from LSH and min-hash too.
you could use probability to speed it up, e.g. only check each pair with 1/50 probability. That will give you a 50x speed up. Then double check any pairs that make it close enough to 1/50th of M.
To double check any pairs, you can either go through the whole list again, or you could double check more efficiently if you do some clever kind of reverse indexing as you go. e.g. encode each persons row indices into 64 bit integers, you could use binary search / merge sort type techniques to see which 64 bit integers to compare, and use bit operations to compare 64 bit integers for matches. Other things to look up could be reverse indexing, binary indexed range trees / fenwick trees.

Updating & Querying all elements in array >= X where X is variable fast

Formally we are given an array with some initial values. Then we have 3 types of Queries :-
Point updates : Increment by 1 at a given position
Range Queries : To count number of elements>=x where x is taken as input
Range Updates : To decrement by 1 all elements>=x, where x is given as input.
N=105 , Q=105 (number of elements in array, number of Queries resp.)
I tried doing this with segment Tree but operations 2,3 can be worse than O(n) even as we don't know which 'range' is to be updated exactly so we may end up traversing whole of segment tree.
NOTE : I wish to clear that if we need to do all 3 operations in logarithmic Worst case ,ie O(log n) ,cause only then we can do this fast , linear approach doesn't works as Q=10^5 n N=10^5 , so worst case could be O(n^2) ,ie 10^10 operation which is clearly not feasible.
Given that you're talking about 105 items, and don't mention needing to add or remove items, it seems to me that the obvious data structure would be a simple sorted vector.
Operation complexities:
point update: O(1) + O(m) (where m is the number of subsequent elements equal to the value before the update).
Range query: O(log n) + O(m) (where n is start of range, m is elements in range).
Range update (same as range query).
It's a little difficult to be sure what "fast" means to you, but the fastest theoretically possible for 1 is O(1), so we're already within some constant factor of optimal.
For 2 and 3, even if we could do the find with constant complexity, we're pretty much stuck with O(m) for the update. Since Log2100000 = ~16.6, most of the time the O(m) term is going to dominate (i.e., the update part will involve as many operations as the search unless the given x is one of the last 17 items in the collection.
I doubt there's any point for this small of a collection, but if you might have to deal with a substantially larger collection and the items in the collection are reasonably predictably distributed, it might be worth considering doing an interpolating search instead of a binary search. With predictable distribution this reduces the expected number of comparisons to approximately O(log log n). In this case, that would be roughly 4 (but normally with a higher constant factor). This might be a win for 105 items, but then again it might not. If you might have to deal with a collection of (say) 108 items or more, it would be much more likely to be a substantial win.
The following may not be optimal, but is the best I could think of tonight.
Let's start by trying to turn the problem sideways. Instead of a map from indices to values, let's consider a map from values to sets of indices. A point update now involves removing an index from one set and adding it to another. A range update involves either simply moving an index set from one value to another or taking the union of two index sets. A range query involves folding over the sets corresponding to the values in range. A quick peek at Wikipedia suggests a traditional disjoint-set data structure is really great for set unions. Unfortunately, it's no good at all for removing an element from a set.
Fortunately, there is a newer data structure supporting union-find with constant time deletion! That takes care of both point updates and range updates quite naturally. Range queries, unfortunately, will require checking all array elements, even if very few elements are in range.

Combinations (n choose k) parallelisation and efficiency

Recently I have been working with combinations of words to make "phrases" in different languages and I have noticed a few things that I could do with some more expert input on.
Defining some constants for this,
Depths (n) is on average 6-7
The length of the input set is ~160 unique words.
Memory - Generating n permutations of 160 words wastes lots of space. I can abuse databases by writing it to disk, but then I take a hit in performance as I need to constantly wait for IO. The other trick is to generate the combinations on the fly like a generator object
Time - If Im not wrong n choose k gets big fast something like this formula factorial(n) / (factorial(depth) * (factorial(n-depth))) this means that input sets get huge quickly.
My question is thus.
Considering I have an function f(x) that takes a combination and applies a calculation that has a cost, e.g.
func f(x) {
if query_mysql("text search query").value > 15 {
return true
return false
How can I efficiently process and execute this function on a huge set of combinations?
Bonus question, can combinations be generated concurrently?
Update: I already know how to generate them conventionally, its more a case of making it efficient.
One approach will be to first calculate how much parallelism you can get, based on the number of threads you've got. Let the number of threads be T, and split the work as follows:
sort the elements according to some total ordering.
Find the smallest number d such that Choose(n,d) >= T.
Find all combinations of 'depth' (exactly) d (typically much lower than to depth d, and computable on one core).
Now, spread the work to your T cores, each getting a set of 'prefixes' (each prefix c is a combination of size d), and for each case, find all the suffixes that their 'smallest' element is 'bigger' than max(c) according to the total ordering.
this approach can also be translated nicely to map-reduce paradigm.
map(words): //one mapper
sort(words) //by some total ordering function
generate all combiations of depth `d` exactly // NOT K!!!
for each combination c produced:
idx <- index in words of max(c)
reduce(c1, words): //T reducers
combinations <- generate all combinations of size k-d from words
for each c2 in combinations:
c <- concat(c1,c2)
Use one of the many known algorithms to generate combinations. Chase's Twiddle algorithm is one of the best known and perfectly suitable. It captures state in an array, so it can be restarted or seeded if wished.
See Algorithm to return all combinations of k elements from n for lots more.
You can progress through your list at your own pace, using minimal memory and no disk IO. Generating each combination will take a microscopic amount of time compared to the 1 sec or so of your computation.
This algorithm (and many others) are easily adapted for parallel execution if you have the necessary skills.

Limited Sort/Filter Algorithm

I have a rather large list of elements (100s of thousands).
I have a filter that can either accept or not accept elements.
I want the top 100 elements that satisfy the filter.
So far, I have sorted the results first and then taken the top 100 that satisfy the filter. The rationale behind this is that the filter is not entirely fast.
But right now, the sorting step is taking way longer than the filtering step, so I would like to combine them in some way.
Is there an algorithm to combine the concerns of sorting/filtering to get the top 100 results satisfying the filter without incurring the cost of sorting all of the elements?
My instinct is to select the top 100 elements from the list (much cheaper than a sort, use your favorite variant of QuickSelect). Run those through the filter, yielding n successes and 100-n failures. If n < 100 then repeat by selecting 100-n elements from the top of the remainder of the list:
k = 100
while (k > 0):
select top k from list and remove them
filter them, yielding n successes
k = k - n
All being well this runs in time proportional to the length of the list, since each selection step runs in that time, and the number of selection steps required depends on the success rate of the filter, but not directly on the size of the list.
I expect this has some bad cases, though. If almost all elements fail the filter then it's considerably slower than just sorting everything, since you'll end up selecting thousands of times. So you might want some criteria to bail out if it's looking bad, and fall back to sorting the whole list.
It also has the problem that it will likely do a largeish number of small selects towards the end, since we expect k to decay exponentially if the filter criteria are unrelated to the sort criteria. So you could probably improve it by selecting somewhat more than k elements at each step. Say, k divided by the expected success rate of the filter, plus a small constant. The expectation based on past performance if there's no domain knowledge you can use to predict it, and the small constant chosen experimentally to avoid an annoyingly large number of steps to find the last few elements. If you end up at any step with more items that have passed the filter than the number you're still looking for (i.e, n > k), then select the top k from the current batch of successes and you're done.
Since QuickSelect gives you the top k without sorting those k, you'll need to do a final sort of 100 elements if you need the top 100 in order.
I've solved this exact problem by using a binary tree for sorting and by keeping count of the elements to the left of the current node during insertion. See (Figure 4.4 et al) for details.
If I understand right, you have two choiced:
Selecting 100 Elements - N operations of the filter check. Then 100(lg 100) for the sort.
Sorting then selecting 100 Elements - At least N(lg N) for the sort, then the select.
the first sounds shorter then sorting then selecting.
I'd probably filter first, then insert the result of that into a priority queue. Keep track of the number of items in the PQ, and after you do the insert, if it's larger than the number you want to keep (100 in your case), pop off the smallest item and discard it.
Steve's suggestion to use Quicksort is a good one.
1 Read in the first 1000 or so elements.
2 Sort them and pick the 100th largest element.
3 Run one pass of Quicksort on the whole file with the element from step 2 as the pivot.
4 Select the upper half of the result of the Quicksort pass for further processing.
You are guaranteed at least 100 elements in the upper half of the single pass of Quicksort. Assuming the first 1000 are reasonably representative of the whole file then you should end up with about one tenth of the original elements at step 4.

Finding the repeated element

In an array with integers between 1 and 1,000,000 or say some very larger value ,if a single value is occurring twice twice. How do you determine which one?
I think we can use a bitmap to mark the elements , and then traverse allover again to find out the repeated element . But , i think it is a process with high complexity.Is there any better way ?
This sounds like homework or an interview question ... so rather than giving away the answer, here's a hint.
What calculations can you do on a range of integers whose answer you can determine ahead of time?
Once you realize the answer to this, you should be able to figure it out .... if you still can't figure it out ... (and it's not homework) I'll post the solution :)
EDIT: Ok. So here's the elegant solution ... if the list contains ALL of the integers within the range.
We know that all of the values between 1 and N must exist in the list. Using Guass' formula we can quickly compute the expected value of a range of integers:
Sum(1..N) = 1/2 * (1 + N) * Count(1..N).
Since we know the expected sum, all we have to do is loop through all the values and sum their values. The different between this sum and the expected sum is the duplicate value.
EDIT: As other's have commented, the question doesn't state that the range contains all of the integers ... in this case, you have to decide whether you want to optimize for memory or time.
If you want to perform the operation using O(1) storage, you can perform an in-place sort of the list. As you're sorting you have to check adjacent elements. Once you see a duplicate, you know you can stop. Optimal sorting is an O(n log n) operation on average - which establishes an upper bound for find the duplicate in this manner.
If you want to optimize for speed, you can use an additional O(n) storage. Using a HashSet (or similar structure), insert values from your list until you determine you are inserting a duplicate into the HashSet. Inserting n items into a HashSet is an O(n) operation on average, which establishes that as an upper bound for this method.
you may try to use bits as hashmap:
1 at position k means that number k occured before
0 at position k means that number k did not occured before
0. assume that your array is A
1. initialize bitarray(there is nice class in c# for this) of 1000000 length filled with zeros
2. for each num in A:
if bitarray[num]
return num
bitarray[num] = 1
The time complexity of the bitmap solution is O(n) and it doesn't seem like you could do better than that. However it will take up a lot of memory for a generic list of numbers. Sorting the numbers is an obvious way to detect duplicates and doesn't require extra space if you don't mind the current order changing.
Assuming the array is of length n < N (i.e. not ALL integers are present -- in this case LBushkin's trick is the answer to this homework problem), there is no way to solve this problem using less than O(n) memory using an algorithm that just takes a single pass through the array. This is by reduction to the set disjointness problem.
Suppose I made the problem easier, and I promised you that the duplicate elements were in the array such that the first one was in the first n/2 elements, and the second one was in the last n/2 elements. Now we can think of playing a game in which two people each hold a string of n/2 elements, and want to know how many messages they have to send to be sure that none of their elements are the same. Since the first player could simulate the run of any algorithm that takes a pass through the array, and send the contents of its memory to the second player, a lower bound on the number of messages they need to send implies a lower bound on the memory requirements of any algorithm.
But its easy to see in this simple game that they need to send n/2 messages to be sure that they don't hold any of the same elements, which yields the lower bound.
Edit: This generalizes to show that for algorithms that make k passes through the array and use memory m, that m*k = Omega(n). And it is easy to see that you can in fact trade off memory for time in this way.
Of course, if you are willing to use algorithms that don't simply take passes through the array, you can do better as suggested already: sort the array, then take 1 pass through. This takes time O(nlogn) and space O(1). But note curiously that this proves that any sorting algorithm that just makes passes through the array must take time Omega(n^2)! Sorting algorithms that break the n^2 bound must make random accesses.
