Randomly Consuming a Set of Elements Indexed From 1 to N - algorithm

In this problem, I have a set of elements that are indexed from 1 to n. Each element actually corresponds to a graph node and I am trying to calculate random one-to-one matchings between the nodes. For the sake of simplicity, I neglect further details of the actual problem. I need to write a fast algorithm to randomly consume these elements (nodes) and do this operation multiple times in order to calculate different matchings. The purpose here is to create randomized inputs to another algorithm and each calculated matching at the end of this will be another input to that algorithm.
The most basic algorithm I can think of is to create copies of the elements in the form of an array, generate random integers, and use them as array indices to apply swap operations. This way each random copy can be created in O(n) but in practice, it uses a lot of copy and swap operations. Performance is very important and I am looking for faster ways (algorithms and data structures) of achieving this goal. It just needs to satisfy the two conditions:
It shall be able to consume a random element.
It shall be able to consume an element on the given index.
I tried to write as clear as possible. If you have any questions, feel free to ask and I am happy to clarify. Thanks in advance.
Note: Matching is an operation where you pair the vertices on a graph if there exists an edge between them.

Shuffle index array (for example, with Fisher-Yates shuffling)
ia = [3,1,4,2]
Walk through index array and "consume" set element with current index
for x in ia:
consume(Set[indexed by x])
So for this example you will get order Set[3], Set[1], Set[4], Set[2]
No element swaps, only array of integers is changed

Related

Data structures for fast intersection operations?

Randomly select two sets ,both set contains distinct keys (one key may belongs to multiple sets ,one set can never contain duplicate keys ).
Return a integer which represents for the number of keys belongs to both sets .
For example intersect({1,2,3,4},{3,4,5}) returns 2 .
I just need the size of the intersection .I don't need to know exactly which keys are there in the intersection .
Are there any datastructures support this kind of operations in less than O(n) time ?
Edit:
Reading the data out does requires O(n) time, but that would not lead to the conclusion that you can't do the intersection operation in less than O(n) time .
Image this scenario:
I have N sets,each contains 100 keys . I read them up, that's N*100 operations .Now I want to know witch pair of sets have the largest intersection ,that's O(N²) intersection operations .So I want to reduce the complexity of the intersection operation .I don't really care how much time it takes to read and construct the sets ,it's at most N*100,that's nothing compares to O(N²) intersection operations .
Be aware ,there's no way that you can find the pair of sets that have the largest intersection by doing less than O(N²) intersection operations ,I can prove that .You must do all the intersection operations .
(he basic idea is ,let's imagine a complete graph ,with N vertices ,each represent for a set ,and Nx(N-1)/2 edges, each represents for a intersection for the connected pair .Now give each edge an non-negetive weight all you want(represents for the intersection size) ,I can always construct N sets satisfy those Nx(N-1)/2 edge weights .That proves my claim .)
I would advice you to take a look at the two possible alternatives, which work particularly well in practice (especially in case of the large sets).
1. The Bloom Filter data structure
A Bloom filter is a space-efficient (based on the bit array) probabilistic data structure, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.
There is a trade-off between False Positive rate and the memory footprint of the Bloom Filter. Hence, it is possible to estimate the appropriate size of the Bloom Filter for different use cases.
Each set can be associated with its own Bloom Filter. It is very easy to obtain the Bloom Filter, which corresponds to the intersection of the different sets: all bit arrays, which correspond to the different Bloom Filters, can be combined using the bitwise AND operation.
Having the Bloom Filter, which corresponds to the intersection, it is possible to find quickly the items, which are present in all intersected sets.
Apart from that, it is possible to estimate the cardinality of the intersection without the actual iteration over the entire sets: https://en.wikipedia.org/wiki/Bloom_filter#The_union_and_intersection_of_sets
2. The Skip list data structure
A Skip List is a data structure that allows fast search and intersection within an ordered sequence of elements. Fast search and intersection are made possible by maintaining a linked hierarchy of subsequences, each skipping over fewer elements.
Succinctly saying, the Skip List is very similar to the plain Linked List data structure, however each node of the Skip List has a couple of additional pointers to items, which are located further (pointers, which "skips" over the couple of other nodes of the list).
Hence, in order to obtain the intersection - it is needed to maintain the pointers inside all Skip Lists, which are being intersected. During the intersection of the Skip Lists pointers are jumping over the items, which are not present in all intersected Skip Lists. Hence, usually the runtime complexity of the intersection operation is faster then O(N).
The algorithm of the intersection of Skip Lists is described in the book "Introduction to Information Retrieval" (written by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze):
http://nlp.stanford.edu/IR-book/html/htmledition/faster-postings-list-intersection-via-skip-pointers-1.html
Skip Lists are actively used in a high-performance, full-featured text search engine library: Apache Lucene (Skip Lists are used inside the Inverted Index component).
Here is an additional Stackoverflow question about the usage of Skip Lists in Lucene: how lucene use skip list in inverted index?
Let's assume there is an algorithm which allows checking the intersection length in less then O(n) time. Now let's read part of the input. We have two options:
We've read whole set and part of another or we've read part of first set and part of the other.
Option 1):
counter-example - let's take such input that there exists an element which was read in set 1 and hasn't been read from set 2 but it is in set 2 - we'll receive incorrect result.
Option 2):
counter-example - we can have input such there exists element that is in two sets but hasn't been read in at least one. We receive incorrect result.
OK, we've proven that there is no such algorithm that returns the correct result when we don't read the whole input.
Let's read the whole input - n numbers. Oops, the complexity is O(n).
End of proof.

Optimized Algorithm: Fastest Way to Derive Sets

I'm writing a program for a competition and I need to be faster than all the other competitors. For this I need a little algorithm help; ideally I'd be using the fastest algorithm.
For this problem I am given 2 things. The first is a list of tuples, each of which contains exactly two elements (strings), each of which represents an item. The second is an integer, which indicates how many unique items there are in total. For example:
# of items = 3
[("ball","chair"),("ball","box"),("box","chair"),("chair","box")]
The same tuples can be repeated/ they are not necessarily unique.) My program is supposed to figure out the maximum number of tuples that can "agree" when the items are sorted into two groups. This means that if all the items are broken into two ideal groups, group 1 and group 2, what are the maximum number of tuples that can have their first item in group 1 and their second item in group 2.
For example, the answer to my earlier example would be 2, with "ball" in group 1 and "chair" and "box" in group 2, satisfying the first two tuples. I do not necessarily need know what items go in which group, I just need to know what the maximum number of satisfied tuples could be.
At the moment I'm trying a recursive approach, but its running on (n^2), far too inefficient in my opinion. Does anyone have a method that could produce a faster algorithm?
Thanks!!!!!!!!!!
Speed up approaches for your task:
1. Use integers
Convert the strings to integers (store the strings in an array and use the position for the tupples.
String[] words = {"ball", "chair", "box"};
In tuppls ball now has number 0 (pos 0 in array) , chair 1, box 2.
comparing ints is faster than Strings.
2. Avoid recursion
Recursion is slow, due the recursion overhead.
For example look at binarys search algorithm in a recursive implementatiion, then look how java implements binSearch() (with a while loop and iteration)
Recursion is helpfull if problems are so complex that a non recursive implementation is to complex for a human brain.
An iterataion is faster, but not in the case when you mimick recursive calls by implementing your own stack.
However you can start implementing using a recursiove algorithm, once it works and it is a suited algo, then try to convert to a non recursive implementation
3. if possible avoid objects
if you want the fastest, the now it becomes ugly!
A tuppel array can either be stored in as array of class Point(x,y) or probably faster,
as array of int:
Example:
(1,2), (2,3), (3,4) can be stored as array: (1,2,2,3,3,4)
This needs much less memory because an object needs at least 12 bytes (in java).
Less memory becomes faster, when the array are really big, then your structure will hopefully fits in the processor cache, while the objects array does not.
4. Programming language
In C it will be faster than in Java.
Maximum cut is a special case of your problem, so I doubt you have a quadratic algorithm for it. (Maximum cut is NP-complete and it corresponds to the case where every tuple (A,B) also appears in reverse as (B,A) the same number of times.)
The best strategy for you to try here is "branch and bound." It's a variant of the straightforward recursive search you've probably already coded up. You keep track of the value of the best solution you've found so far. In each recursive call, you check whether it's even possible to beat the best known solution with the choices you've fixed so far.
One thing that may help (or may hurt) is to "probe": for each as-yet-unfixed item, see if putting that item on one of the two sides leads only to suboptimal solutions; if so, you know that item needs to be on the other side.
Another useful trick is to recurse on items that appear frequently both as the first element and as the second element of your tuples.
You should pay particular attention to the "bound" step --- finding an upper bound on the best possible solution given the choices you've fixed.

Data structure for range query

I was recently asked a coding question on the below problem.
I have some solution to this problem but I am not very sure if those are most efficient.
Problem:
Write a program to track set of text ranges. Start point and end point will be string.
Text range example : [AbA-Ef]
Aa would fall before this range
AB would fall inside this range
etc.
String comparison would be like 'A' < 'a' < 'B' < 'b' ... 'Z' < 'z'
We need to support following operations on this range
Add range - this should merge the ranges if applicable
Delete range - this deletes range from tracked ranges and recompute the ranges
Query range - Given a character, function should return whether it is part of any of tracked ranges or not.
Note that tracked ranges can be dis-continuous.
My solutions:
I came up with two approaches.
Store ranges as doubly linked list or
Store ranges as some sort of balanced tree with leaf node having actual data and they are inter-connected as linked list.
Do you think that this solution are good enough or you can think of any better way of doing this so that those three API gives your best performance ?
You are probably looking for an interval tree.
Use the data structure with your custom comparator to indicate "What's on range", and you will be able to do the required operations efficiently.
Note, an interval tree is actually an efficient way to implement your 2nd idea (Store ranges as a some sort of balanced tree)
I'm not clear on what the "delete range" operation is supposed to do. Does it;
Delete a previously inserted range, and recompute the merge of the remaining ranges?
Stop tracking the deleted range, regardless of how many times parts of it have been added.
That doesn't make a huge difference algorithmically; it's just bookkeeping. But it's important to clarify. Also, are the ranges closed or half-open? (Another detail which doesn't affect the algorithm but does affect the implementation).
The basic approach to this problem is to merge the tracked set into a sorted list of disjoint (non-overlapping) ranges; either as a vector or a binary search tree, or basically any structure which supports O(log n) searching.
One approach is to put both endpoints of every disjoint range into the datastructure. To find out if a target value is in a range, find the index of the smallest endpoint greater than the target. If the index is odd the target is in some range; even means it's outside.
Alternatively, index all the disjoint ranges by their start points; find the target by searching for the largest start-point not greater than the target, and then compare the target with the associated end-point.
I usually use the first approach with sorted vectors, which are plausible if (a) space utilization is important and (b) insert and merge are relatively rare. With binary search trees, I go for the second approach. But they differ only in details and constants.
Merging and deleting are not difficult, but there are an annoying number of cases. You start by finding the ranges corresponding to the endpoints of the range to be inserted/deleted (using the standard find operation), remove all the ranges in between the two, and fiddle with the endpoints to correct the partially overlapping ranges. While the find operation is always O(log n), the tree/vector manipulation is o(n) (if the inserted/deleted range is large, anyway).
Most languages, including Java and C++, have a some sort of ordered map or ordered set in which you can both look up a value and find the next value after or the first value before a value. You could use this as a building block - If it contains a set of disjoint ranges then it will have a least element of a range followed by a greatest element of a range followed by the least element of a range followed by the greatest element of a range and so on. When you add a range you can check to see if you have preserved this property. If not, you need to merge ranges. Similarly, you want to preserve this when you delete. Then you can query by just looking to see if there is a least element just before your query point and a greatest element just after.
If you want to create your own datastructure from scratch, I would think about some sort of radix trie structure, because this avoids doing lots of repeated string comparisons.
I think you would go for B+ tree it's the same which you have mentioned as your second approach.
Here are some properties of B+ tree:
All data is stored leaf nodes.
Every leaf is at the same level.
All leaf nodes have links to other leaf nodes.
Here are few applications B+ tree:
It reduces the number of I/O operations required to find an element in the tree.
Often used in the implementation of database indexes.
The primary value of a B+ tree is in storing data for efficient retrieval in a block-oriented storage context — in particular, file systems.
NTFS uses B+ trees for directory indexing.
Basically it helps for range queries look ups, minimizes tree traversing.

Profiling sorting algorithms against partially sorted data

We know that several sorts, such as insertion sort, are great on arrays that are 'mostly-sorted' and not so great on random data.
Suppose we wanted to profile the performance improvement/degradation of such an algorithm relative to how 'sorted' the input data is. What would be a good way to generate an 'increasingly sorted' or 'increasingly random' array of elements? How might we measure the 'sortedness' of the input?
Number of Inversion is a usual measure of how much sorted an array is.
A pair of elements (pi,pj) in permutation p is called an inversion in a permutation if i<j and pi >pj. For example, in the permutation (3,1,2,5,4) contains the 3 inversions (3,1), (3,2) and (5,4).
A sorted array got 0 inversion and reverse sorted array got n*(n-1)/2.
You could generate a "partially sorted" dataset by interrupting a modern Fisher-Yates shuffle run on an already ordered dataset.
Also, if you only need a few essentially fixed sets of partially sorted data, then you could generate a column graph of position vs value for each and just eye-ball them. That would let you quickly see the general random-ness of a set, as well things like how much localised order there is.
Also look into creating a binary heap, and then using the array representation as your starting point. A binary heap implemented in an array is not sorted, but it is ordered. I think it would be considered "partially sorted."

Finding median of large set of numbers too big to fit into memory

I was asked this question in an interview recently.
There are N numbers, too many to fit into memory. They are split across k database tables (unsorted), each of which can fit into memory. Find the median of all the numbers.
Wasn't quite sure about the answer to this one.
There's a few potential solutions:
External merge sort - O(n log n)
You basically sort the numbers on the first pass, then find the median on the second.
Order statistics distributed selection algorithm - O(n)
Simplify the problem to the original problem of finding the kth number in an unsorted array.
Counting sort histogram O(n)
You have to assume some properties about the range of the numbers - can the range fit in the memory?
If anything is known about the distribution of the numbers other
algorithms can be produced.
For more details and implementation see:
http://www.fusu.us/2013/07/median-in-large-set-across-1000-servers.html
This answer on quora explains the whole process clearly step by step http://qr.ae/dMkGc. Simply copying it down for non Quorans
Suppose you have a master node (or are able to use a consensus protocol to elect a master from among your servers). The master first queries the servers for the size of their sets of data, call this n, so that it knows to look for the k = n/2 largest element.
The master then selects a random server and queries it for a random element from the elements on that server. The master broadcasts this element to each server, and each server partitions its elements into those larger than or equal to the broadcasted element and those smaller than the broadcasted element.
Each server returns to the master the size of the larger-than partition, call this m. If the sum of these sizes is greater than k, the master indicates to each server to disregard the less-than set for the remainder of the algorithm. If it is less than k, then the master indicates to disregard the larger-than sets and updates k = k - m. If it is exactly k, the algorithm terminates and the value returned is the pivot selected at the beginning of the iteration.
If the algorithm does not terminate, recurse beginning with selecting a new random pivot from the remaining elements.
Analysis:
Let n be the total number of elements and s be the number of servers. Assume that the elements are roughly randomly and evenly distributed among servers (each server has O(n/s) elements). In iteration i, we expect to do about O(n/(s*2^i)) work on each server, as the size of each servers element sets will be approximately cut in half (remember, we assumed roughly random distribution of elements) and O(s) work on the master (for broadcasting/receiving messages and adding the sizes together). We expect O(log(n/s)) iterations. Adding these up over all iterations gives an expected runtime of O(n/s + slog(n/s)), and assuming s << sqrt(n) which is normally the case, this becomes simply (O(n/s)), which is the best you could possibly hope for.
Note also that this works not just for finding the median but also for finding the kth largest value for any value of k.
Have a look at the "Median of Medians" algorithm in this Wikipedia article.
Related question: Median-of-medians in Java.
Explanation: http://www.ics.uci.edu/~eppstein/161/960130.html
Another way to look at this is to go back to the definition of "median." Authors vary in their language, but basically the median is the value which splits a probability distribution into two equal parts.
So instead of spending a lot of effort sorting enormous data sets, estimate the distribution and find the middle. As noted above for some distributions the median equals the mean, which is quick and easy to compute. Also, if an exact answer isn't necessary you can use the empirical relationship: mean - mode = 3 * (mean - median).
Here is what I would do:
Sample the data to get a general idea about the distribution.
Using the information about the distribution, choose a "bucket" (a range), large enough to get the median inside and small enough to fit into the memory.
With one pass (O(N)) count the numbers before the bucket (L1_size), after the bucket (L3_size) and put numbers within the range into the bucket (L2). You will see if the chosen bucket contains the median. If not - go to step 2.
Use quickselect or other method to find the k=(L1_size + L2_size/2) element in the bucket.
Requires O(N) + O(L2_size) steps.
I was also asked the same question and i couldn't tell an exact answer so after the interview i went through some books on interviews and here is what i found.
Example: Numbers are randomly generated and stored into an (expanding) array. How
wouldyoukeep track of the median?
Our data structure brainstorm might look like the following:
• Linked list? Probably not. Linked lists tend not to do very well with accessing and
sorting numbers.
• Array? Maybe, but you already have an array. Could you somehow keep the elements
sorted? That's probably expensive. Let's hold off on this and return to it if it's needed.
• Binary tree? This is possible, since binary trees do fairly well with ordering. In fact, if the binary search tree is perfectly balanced, the top might be the median. But, be careful—if there's an even number of elements, the median is actually the average
of the middle two elements. The middle two elements can't both be at the top. This is probably a workable algorithm, but let's come back to it.
• Heap? A heap is really good at basic ordering and keeping track of max and mins.
This is actually interesting—if you had two heaps, you could keep track of the bigger
half and the smaller half of the elements. The bigger half is kept in a min heap, such
that the smallest element in the bigger half is at the root.The smaller half is kept in a
max heap, such that the biggest element of the smaller half is at the root. Now, with
these data structures, you have the potential median elements at the roots. If the
heaps are no longer the same size, you can quickly "rebalance" the heaps by popping
an element off the one heap and pushing it onto the other.
Note that the more problems you do, the more developed your instinct on which data
structure to apply will be. You will also develop a more finely tuned instinct as to which of these approaches is the most useful.
If an approximate answer is sufficient, a method similar to #piccolbo works well. I'll assume all the points are integers, but if not you can multiply by ten or a hundred or whatever to normalize the data to integers. Make one pass over the data calculating an average (arithmetic mean. Call that number the provisional median. Then make a second pass over the data. If the data point is less than the provisional median, reduce the provisional median by one. If the data point is greater than the provisional median, increase the provisional median by one. If the data point is the same as the provisional median, leave the provisional median unchanged. After the end of the data, return the provisional median. What will happen is that the provisional median will initially change from time to time, but eventually it will stabilize over a very small range, which will be very close to the actual median.

Resources