Sampling without using random()? - algorithm

I was recently asked to implement a sampleStream() method that would choose each element with equal probability, but not use random(). I thought the interviewer was looking for reservoir sampling, but as I stumbled through it, he added that it was an approach called "stratified sampling". Admittedly, I may have been thrown off by that, because there is a statistical method called stratified sampling, and I was trying to think of how I could use that to sample elements from a stream without random. The inputs he specified were the number of items to sample, and a rate at which I should sample (something like 1000/100,000).
Anyway, I'm still stuck on this problem, even though I already didn't get the job for not answering it properly. Googling has failed me here. Can anyone help me understand it?

One way to implement stratified sampling is to sort the list by the keys used for stratification and then do a 1 in n sampling.
Technically, sorting isn't necessary if the keys are categories. In this (typical) case, a hashing method can be used. The idea is still the same: 1 in n sampling on an "ordered" list.
Perhaps this is what the interviewer was referring to.
EDIT:
You can implement stratified sampling on a stream, you would essentially be reading the stream and doing a "bucket" count for each group of similar key values. When the bucket has some arbitrary value, you would output the record. When the bucket hits some value (based on the overall frequencies), then you would reset the counter and repeat (or use modulo arithmetic).
However, this doesn't have an equal probability of getting each record. For that, I really do think you need some sort of randomization. An approach that comes close would be to store the records for each group in a bucket and then choose a random record when the bucket is full. You can emulate randomness by using a hash key on some other value (such as the time of insert) and then choosing the the minimum or maximum hash key value. (And, you can make this more efficient by just storing one record.)

Related

Performance comparsion: Algorithm S and Algorithm Z

Recently I ran into two sampling algorithms: Algorithm S and Algorithm Z.
Suppose we want to sample n items from a data set. Let N be the size of the data set.
When N is known, we can use Algorithm S
When N is unknown, we can use Algorithm Z (optimized atop Algorithm R)
Performance of the two algorithms:
Algorithm S
Time complexity: average number of scanned items is n(N+1)/n+1 (I compute the result, Knuth's book left this as exercises), we can say it O(N)
Space complexity: O(1) or O(n)(if returning an array)
Algorithm Z (I search the web, find the paper https://www.cs.umd.edu/~samir/498/vitter.pdf)
Time complexity: O(n(1+log(N/n))
Space complexity: in TAOCP vol2 3.4.2, it mentions Algorithm R's space complexity is O(n(1+log(N/n))), so I suppose Algorithm Z might be the same
My question
The model for Algorithm Z is: keep calling next method on the data set until we reach the end. So for the problem that N is known, we can still use Algorithm Z.
Based on the above performance comparison, Algorithm Z has better time complexity than Algorithm S, and worse space complexity.
If space is not a problem, should we use Algorithm Z even when N is known?
Is my understanding correct? Thanks!
Is the Postgres code mentioned in your comment actually used in production? In my opinion, it really should be reviewed by someone who has at least some understanding of the problem domain. The problem with random sampling algorithms, and random algorithms in general, is that it is very hard to diagnose biased sampling bugs. Most samples "look random" if you don't look too hard, and biased sampling is only obvious when you do a biased sample of a biased dataset. Or when your biased sample results in a prediction which is catastrophically divergent from reality, which will eventually happen but maybe not when you're doing the code review.
Anyway, by way of trying to answer the questions, both the one actually in the text of this post and the ones added or implied in the comment stream:
Properly implemented, Vitter's algorithm Z is much faster than Knuth's algorithm S. If you have a use case in which reservoir sampling is indicated, then you should probably use Vitter, subject to the code testing advice above: Vitter's algorithm is more complicated and it might not be obvious how to validate the implementation.
I noticed in the Postgres code that it just uses the threshold value of 22 to decide whether to use the more complicated code, based on testing done almost 40 years ago on hardware which you'd be hard pressed to find today. It's possible that 22 is not a bad threshold, but it's just a number pulled out of thin air. At least some attempt should be made to verify or, more likely, correct it.
Forty years ago, when those algorithms were developed, large datasets were typically stored on magnetic tape. Magnetic tape is still used today, but applications have changed; I think that you're not likely to find a Postgres installation in which a live database is stored on tape. This matters because the way you get data off a tape drive is radically different from the way you get data from a file server. Or a sharded distributed collection of file servers, which also has its particular needs.
Data on a reel of tape can only be accessed linearly, although it is possible to skip tape somewhat faster than you can read it. On a file server, data is random access; there may be a slight penalty for jumping around in a file, but there might not. (On the sharded distributed model, it might well be faster then linear reads.) But trying to read out of order on a tape drive might turn an input operation which takes an hour into an operation which takes a week. So it's very important to access the sample in order. Moreover, you really don't want to have to read the tape twice, which would take twice as long.
One of the other assumptions that was made in those algorithms is that you might not have enough memory to store the entire sample; in 1985, main memory was horribly expensive and databases were already quite large. So a common way to collect a large sample from a huge database was to copy the sampled blocks onto secondary memory, such as another tape drive. But there's a bit of a catch with reservoir sampling: as the sampling algorithm proceeds, some items which were initially inserted in the sample are later replaced with other items. But you can't replace data written on tape, so you need to just keep on appending the newly selected samples. What you do hold in random access memory is a list of locations of the sample; once you've finished selecting the sample, you can sort this list of locations and then use it to read out the final selection in storage order, skipping over the rejected items. That means that the temporary sample storage ends up holding both the final sample, and some number of later rejected items. The O(n(1+log(N/n))) space complexity in Algorithm R refers to precisely this storage, and it's actually a reasonably small multiplier, considering.
All that is irrelevant if you can just allocate enough random access storage somewhere to hold the entire sample. Or, even better, if you can directly read a data from the database. There could well still be good reasons to read the sample into local storage, but nothing stops you from updating a block of local storage with a different block.
On the other hand, in many common cases, you don't need to read the data in order to sample it. You can just take a list of items numbers, select a sample from that list of the desired size, and then set about acquiring the sample from the list of selected item numbers. And that presents a rather different problem: how to choose an unbiased sample of size k from a set of K item indexes.
There's a fast and simple solution to that (also described by Knuth, unsurprisingly): make an array of all the item numbers (say, the integers from 0 to K, and then shuffle the array using the standard Knuth/Fisher-Yates shuffle, with a slight modification: you run the algorithm from front to back (instead of back to front, as it is often presented), and stop after k iterations. At that point the first k elements in the partially shuffled array are an unbiased sample. (In fact, you don't need the entire vector of K indices, as long as k is much smaller than K. You're only going to touch O(k) of the values, and you can keep the ones you touched in a hash table of size O(k).)
And there's an even simpler algorithm, again for the case where the sample is small relative to the dataset: just keep one bit for each item in the dataset, which indicates that the item has been selected. Now select k items at random, marking the bit vector as you go; if the relevant bit is already marked, then that item is already in the sample; you just ignore that selection and continue with the next random choice. The expected number of ignored sample is very small unless the sample size is a significant fraction of the dataset size.
There's one other criterion which weighed on the minds of Vitter and Knuth: you'll normally want to do something with the selected sample. And given the amount of time it takes to read through a tape, you want to be able to start processing each item immediately as it is accepted. That precludes algorithms which include, for example, "sort the selected indices and then read the indicated items. (See above.) For immediate processing to be possible, you must not depend on being able to "deselect" already selected items.
Fortunately, both the quick algorithms mentioned at the end of point 2 do satisfy this requirement. In both cases, an item once selected will never be later rejected.
There is at least one use case for reservoir sampling which is still very much relevant: sampling a datastream which is too voluminous or too high-bandwidth to store. That might be some kind of massive social media feed, or it might be telemetry data from a large sensor array, or whatever. In that case, you might want to reduce the size of the datastream by extracting only a small sample, and reservoir sampling is a good candidate. However, that has nothing to do with the Postgres example.
In summary:
Yes, you can (and probably should) use Vitter's Algorithm Z in preference to Knuth's Algorithm S, even if you know how big the data set it.
But there are certainly better algorithms, some of which are outlined above.

Is there a specific scenario of a hash table that isn't full yet an insertion can't occur?

What I mean to ask is for a hash-table following the standard size of a prime number, is it possible to have some scenario (of inserted keys) where no further insertion of a given element is possible even though there's some empty slots? What kind of hash-function would achieve that?
So, most hash functions allow for collisions ("Hash Collisions" is the phrase you should google to understand this better, by the way.) Collisions are handled by having a secondary data structure, like a list, to store all of the values inserted at keys with the same hash.
Because these data structures can generally store arbitrarily many elements, you will always be able to insert into the hash table, but the performance will get worse and worse, approaching the performance of the backing data structure.
If you do not have a backing data structure, then you can be unable to insert as soon as two things get added to the same position. Since a good hash function distributes things evenly and effectively randomly, this would happen pretty quickly (see "The Birthday Problem").
There are failure-to-insert scenarios for some but not all hash table implementations.
For example, closed hashing aka open addressing implementations use some logic to create a sequence of buckets in which they'll "probe" for values not found at the hashed-to bucket due to collisions. In the real world, sometimes the sequence-creation is pretty basic, for example:
the programmer might have hard-coded N prime numbers, thinking the odds of adding in each of those in turn and still not finding an empty bucket are low (but a malicious user who knows the hash table design may be able to calculate values to make the table fail, or it may simply be so full that the odds are no longer good, or - while emptier - a statistical freak event)
the programmer might have done something like picked a prime number they liked - say 13903 - to add to the last-probed bucket each time until a free one is found, but if the table size happens to be 13903 too it'll keep checking the same bucket.
Still, there are probing approaches such as linear probing that guarantee to try all buckets (unless the implementation goes out of its way to put a limit on retries). It has some other "issues" though, and won't always be the best choice.
If a hash table is implemented using open addressing instead of separate chaining, then it is a good idea to leave at least 1 slot empty to simplify the algorithm.
In open addressing when we are trying to find an element, we first compute the hash index i, then check the table at indexes {i, i + 1, i + 2, ... N - 1, (wrapping around) 0, 1, 2, ...}, until we either find the element we want or hit an empty slot. You can see that in this algorithm, if no slot is empty but the element can't be found, then the search would loop forever.
However, I should emphasize that enforcing merely simplifies the search algorithm. Because alternatively, the search algorithm can remember the starting index i, and halt the search if the entire table has been scanned and it lands back at index i.

Iterative Hash Algorithm for Fast File Check

I want to create a representation of the state of all files in a folder (ignoring order), so that I can send this state to another computer to check if we are in sync. This "state representation" is 3 numbers concatenated by . which are:
sum . product . number of items
The "sum" is the numerical addition all of the file's md5 numerical representations.
The product is the multiplication of all of the file's md5 numerical representations.
The number of items is just the number of files.
The main reason for doing this is that this allows me to create unique states iteratively/quickly when I add or delete a file (a modification being a combination of delete then add). Also, one should end up with the same "state" even if the same set of operations are performed in any random order.
Adding A File
Generate the file's md5
Calculate the md5's numerical value (x).
Add x to the sum
Multiply the product by x
Increment the number of items.
Removing A File
Generate the file's md5
Calculate the md5's numerical value (x).
Subtract x from the sum
Divide the product by x
Decrement the number of items.
Problems
Since the numerical representations of hashes can be quite large, I may have to use a library to generate results using strings rather than integers which may be quite slow.
With the limited testing I have done, I have not been able to create "collisions" where a collision is where two different sets of file hashes could produce the same state (remember that we are ignoring the order of the file hashes).
Question
I'm sure that I can't be the first person to want to achieve such a thing. Is there an algorithm or iterative hash function that aims to do the same thing already, preferably in PHP, Java, or Python? Is there a term for this type of thing, all I could think of was "iterative hash"? Is there a flaw with this algorithm that I haven't spotted, such as with "collisions" making generated state representations non-unique?
How many states can your file system take ? infinity for all practical purposes.
How long is your hash length ? short enough to be efficient, finite in any case.
Will I get collisions ? Yes.
So, your hash approach seems fine, particularly if it spreads correctly points that are close, i.e. the state of the fs differing by content of just one file hashes to very different values.
However, you should depend on your hash to produce collisions in the long run, it's a mathematical certainty that probability goes to one that someday you get a collision, given that collision chance is not 0.
So to be really safe, you probably need a full MD5 exchange, if speed and fast updates are the goal your scheme sounds good, but I would back it up with more infrequent exchanges of longer keys, just to be on the safe side if sync is mission critical.

What is the best way to analyse a large dataset with similar records?

Currently I am loooking for a way to develop an algorithm which is supposed to analyse a large dataset (about 600M records). The records have parameters "calling party", "called party", "call duration" and I would like to create a graph of weighted connections among phone users.
The whole dataset consists of similar records - people mostly talk to their friends and don't dial random numbers but occasionaly a person calls "random" numbers as well. For analysing the records I was thinking about the following logic:
create an array of numbers to indicate the which records (row number) have already been scanned.
start scanning from the first line and for the first line combination "calling party", "called party" check for the same combinations in the database
sum the call durations and divide the result by the sum of all call durations
add the numbers of summed lines into the array created at the beginning
check the array if the next record number has already been summed
if it has already been summed then skip the record, else perform step 2
I would appreciate if anyone of you suggested any improvement of the logic described above.
p.s. the edges are directed therefore the (calling party, called party) is not equal to (called party, calling party)
Although the fact is not programming related I would like to emphasize that due to law and respect for user privacy all the informations that could possibly reveal the user identity have been hashed before the analysis.
As always with large datasets the more information you have about the distribution of values in them the better you can tailor an algorithm. For example, if you knew that there were only, say, 1000 different telephone numbers to consider you could create a 1000x1000 array into which to write your statistics.
Your first step should be to analyse the distribution(s) of data in your dataset.
In the absence of any further information about your data I'm inclined to suggest that you create a hash table. Read each record in your 600M dataset and calculate a hash address from the concatenation of calling and called numbers. Into the table at that address write the calling and called numbers (you'll need them later, and bear in mind that the hash is probably irreversible), add 1 to the number of calls and add the duration to the total duration. Repeat 600M times.
Now you have a hash table which contains the data you want.
Since there are 600 M records, it seems to be large enough to leverage a database (and not too large to require a distributed Database). So, you could simply load this into a DB (MySQL, SQLServer, Oracle, etc) and run the following queries:
select calling_party, called_party, sum(call_duration), avg(call_duration), min(call_duration), max (call_duration), count(*) from call_log group by calling_party, called_party order by 7 desc
That would be a start.
Next, you would want to run some Association analysis (possibly using Weka), or perhaps you would want to analyze this information as cubes (possibly using Mondrian/OLAP). If you tell us more, we can help you more.
Algorithmically, what the DB is doing internally is similar to what you would do yourself programmatically:
Scan each record
Find the record for each (calling_party, called_party) combination, and update its stats.
A good way to store and find records for (calling_party, called_party) would be to use a hashfunction and to find the matching record from the bucket.
Althought it may be tempting to create a two dimensional array for (calling_party, called_party), that will he a very sparse array (very wasteful).
How often will you need to perform this analysis? If this is a large, unique dataset and thus only once or twice - don't worry too much about the performance, just get it done, e.g. as Amrinder Arora says by using simple, existing tooling you happen to know.
You really want more information about the distribution as High Performance Mark says. For starters, it's be nice to know the count of unique phone numbers, the count of unique phone number pairs, and, the mean, variance and maximum of the count of calling/called phone numbers per unique phone number.
You really want more information about the analysis you want to perform on the result. For instance, are you more interested in holistic statistics or identifying individual clusters? Do you care more about following the links forward (determining who X frequently called) or following the links backward (determining who X was frequently called by)? Do you want to project overviews of this graph into low-dimensional spaces, i.e. 2d? Should be easy to indentify indirect links - e.g. X is near {A, B, C} all of whom are near Y so X is sorta near Y?
If you want fast and frequently adapted results, then be aware that a dense representation with good memory & temporal locality can easily make a huge difference in performance. In particular, that can easily outweigh a factor ln N in big-O notation; you may benefit from a dense, sorted representation over a hashtable. And databases? Those are really slow. Don't touch those if you can avoid it at all; they are likely to be a factor 10000 slower - or more, the more complex the queries are you want to perform on the result.
Just sort records by "calling party" and then by "called party". That way each unique pair will have all its occurrences in consecutive positions. Hence, you can calculate the weight of each pair (calling party, called party) in one pass with little extra memory.
For sorting, you can sort small chunks separately, and then do a N-way merge sort. That's memory efficient and can be easily parallelized.

I was asked this in a recent interview

I was asked to stay away from HashMap or any sort of Hashing.
The question went something like this -
Lets say you have PRODUCT IDs of up to 20 decimals, along with Product Descriptions. Without using Maps or any sort of hashing function, what's the best/most efficient way to store/retrieve these product IDs along with their descriptions?
Why is using Maps a bad idea for such a scenario?
What changes would you make to sell your solution to Amazon?
A map is good to use when insert/remove/lookup operations are interleaved. Every operations are amortized in O(log n).
In your exemple you are only doing search operation. You may consider that any database update (inserting/removing a product) won't happen so much time. Therefore probably the interviewer want you to get the best data structure for lookup operations.
In this case I can see only some as already proposed in other answers:
Sorted array (doing a binary search)
Hasmap
trie
With a trie , if product ids do not share a common prefix, there is good chance to find the product description only looking at the first character of the prefix (or only the very first characters). For instance, let's take that product id list , with 125 products:
"1"
"2"
"3"
...
"123"
"124"
"1234567"
Let's assume you are looking for the product id titled "1234567" in your trie, only looking to the first letters: "1" then "2" then "3" then "4" will lead to the good product description. No need to read the remaining of the product id as there is no other possibilities.
Considering the product id length as n , your lookup will be in O(n). But as in the exemple explained it above it could be even faster to retreive the product description. As the procduct ID is limited in size (20 characters) the trie height will be limited to 20 levels. That actually means you can consider the look up operations will never goes beyond a constant time, as your search will never goes beyong the trie height => O(1). While any BST lookups are at best amortized O(log N), N being the number of items in your tree .
While an hashmap could lead you to slower lookup as you'll need to compute an index with an hash function that is probably implemented reading the whole product id length. Plus browsing a list in case of collision with other product ids.
Doing a binary search on a sorted array, and performance in lookup operations will depends on the number of items in your database.
A B-Tree in my opinion. Does that still count as a Map?
Mostly because you can have many items loaded at once in memory. Searching these items in memory is very fast.
Consecutive integer numbers give perfect choice for the hash map but it only has one problem, as it does not have multithreaded access by default. Also since Amazon was mentioned in your question I may think that you need to take into account concurency and RAM limitation issues.
What you might do in the response to such question is to explain that since
you are dissallowed to use any built-in data storage schemes, all you can do is to "emulate" one.
So, let's say you have M = 10^20 products with their numbers and descriptions.
You can partition this set to the groups of N subsets.
Then you can organize M/N containers which have sugnificantly reduced number of elements. Using this idea recursively will give you a way to store the whole set in containers with such property that access to them would have accepted performance rate.
To illustrate this idea, consider a smaller example of only 20 elements.
I would like you to imagive the file system with directories "1", "2", "3", "4".
In each directory you store the product descriptions as files in the following way:
folder 1: files 1 to 5
folder 2: files 6 to 10
...
folder 4: files 16 to 20
Then your search would only need two steps to find the file.
First, you search for a correct folder by dividing 20 / 5 (your M/N).
Then, you use the given ID to read the product description stored in a file.
This is just a very rough description, however, the idea is very intuitive.
So, perhaps this is what your interviewer wanted to hear.
As for myself, when I face such questions on interview, even if I fail to get the question correctly (which is the worst case :)) I always try to get the correct answer from the interviewer.
Best/efficient for what? Would have been my answer.
E.g. for storing them, probably the fast thing to do are two arrays with 20 elements each. One for the ids, on for the description. Iterating over those is pretty fast to. And it is efficient memory wise.
Of course the solution is pretty useless for any real application, but so is the question.
There is an interesting alternative to B-Tree: Radix Tree
I think what he wanted you to do, and I'm not saying it's a good idea, is to use the computer memory space.
If you use a 64-bit (virtual) memory address, and assuming you have all the address space for your data (which is never the case) you can store a one-byte value.
You could use the ProductID as an address, casting it to a pointer, and then get that byte, which might be an offset in another memory for actual data.
I wouldn't do it this way, but perhaps that is the answer they were looking for.
Asaf
I wonder if they wanted you to note that in an ecommerce application (such as Amazon's), a common use case is "reverse lookup": retrieve the product ID using the description. For this, an inverted index is used, where each keyword in a description is an index key, which is associated with a list of relevant product identifiers. Binary trees or skip lists are good ways to index these key words.
Regarding the product identifier index: In practice, B-Trees (which are not binary search trees) would be used for a large, disk-based index of 20-digit identifiers. However, they may have been looking for a toy solution that could be implemented in RAM. Since the "alphabet" of decimal numbers is so small, it lends itself very nicely to a trie.
The hashmaps work really well if the hashing function gives you a very uniform distribution of the hashvalues of the existing keys. With really bad hash function it can happen so that hash values of your 20 values will be the same, which will push the retrieval time to O(n). The binary search on the other hand guaranties you O(log n), but inserting data is more expensive.
All of this is very incremental, the bigger your dataset is the less are the chances of a bad key distribution (if you are using a good, proven hash algorithm), and on smaller data sets the difference between O(n) and O(log n) is not much to worry about.
If the size is limited sometimes it's faster to use a sorted list.
When you use Hash-anything, you first have to calculate a hash, then locate the hash bucket, then use equals on all elements in the bucket. So it all adds up.
On the other hand you could use just a simple ArrayList ( or any other List flavor that is suitable for the application), sort it with java.util.Collections.sort and use java.util.Collections.binarySearch to find an element.
But as Artyom has pointed out maybe a simple linear search would be much faster in this case.
On the other hand, from maintainability point of view, I would normally use HashMap ( or LinkedHashMap ) here, and would only do something special here when profiler would tell me to do it. Also collections of 20 have a tendency to become collections of 20000 over time and all this optimization would be wasted.
There's nothing wrong with hashing or B-trees for this kind of situation - your interviewer probably just wanted you to think a little, instead of coming out with the expected answer. It's a good sign, when interviewers want candidates to think. It shows that the organization values thought, as opposed to merely parroting out something from the lecture notes from CS0210.
Incidentally, I'm assuming that "20 decimal product ids" means "a large collection of product ids, whose format is 20 decimal characters".... because if there's only 20 of them, there's no value in considering the algorithm. If you can't use hashing or Btrees code a linear search and move on. If you like, sort your array, and use a binary search.
But if my assumption is right, then what the interviewer is asking seems to revolve around the time/space tradeoff of hashmaps. It's possible to improve on the time/space curve of hashmaps - hashmaps do have collisions. So you might be able to get some improvement by converting the 20 decimal digits to a number, and using that as an index to a sparsely populated array... a really big array. :)
Selling it to Amazon? Good luck with that. Whatever you come up with would have to be patentable, and nothing in this discussion seems to rise to that level.
20 decimal PRODUCT IDs, along with Product Description
Simple linear search would be very good...
I would create one simple array with ids. And other array with data.
Linear search for small amount of keys (20!) is much more efficient then any binary-tree or hash.
I have a feeling based on their answer about product ids and two digits the answer they were looking for is to convert the numeric product ids into a different base system or packed form.
They made a point to indicate the product description was with the product ids to tell you that a higher base system could be used within the current fields datatype.
Your interviewer might be looking for a trie. If you have a [small] constant upper bound on your key, then you have O(1) insert and lookup.
I think what he wanted you to do, and
I'm not saying it's a good idea, is to
use the computer memory space.
If you use a 64-bit (virtual) memory
address, and assuming you have all the
address space for your data (which is
never the case) you can store a
one-byte value.
Unfortunately 2^64 =approx= 1.8 * 10^19. Just slightly below 10^20. Coincidence?
log2(10^20) = 66.43.
Here's a slightly evil proposal.
OK, 2^64 bits can fit inside a memory space.
Assume a bound of N bytes for the description, say N=200. (who wants to download Anna Karenina when they're looking for toasters?)
Commandeer 8*N 64-bit machines with heavy RAM. Amazon can swing this.
Every machine loads in their (very sparse) bitmap one bit of the description text for all descriptions. Let the MMU/virtual memory handle the sparsity.
Broadcast the product tag as a 59-bit number and the bit mask for one byte. (59 = ceil(log2(10^20)) - 8)
Every machine returns one bit from the product description. Lookups are a virtual memory dereference. You can even insert and delete.
Of course paging will start to be a bitch at some point!
Oddly enough, it will work the best if product-id's are as clumpy and ungood a hash as possible.

Resources