I am tasked with finding a solution to the following real word problem and I am really puzzled on how I can solve it.
We have 100 million numbers and 1 billion arrays of numbers (each array can hold up to 1.000 unique numbers).
We pick 1.000 random numbers. We are trying to find the IDs of arrays containing more than 1 of our 1.000 numbers. If there are more than 10.000 such arrays we need the 1st 10.000 only.
In a file for each number we store the IDs of arrays that number appears on. We can solve the problem by reading all the array IDs for every number and processing them. But those IDs are 8 bytes each, so we need to read 8*1 billion = 8GB of data per number if our number appears on every array. If we take the worse case scenario we need to read from the HDD 8GB*1.000 = 8TB. This takes days, not 1 second.
Question: How can I do this in 1 second (or a few seconds) instead of days?
Hint: Seems like my problem is similar to problems the search engines face. I have no experience on that field but someone who has can be really helpful here.
Related
I have a dataframe df with 9000 unique ids.
like
| id |
1
2
I want to generate a random sample with replacement these 9000 ids 100000 times.
How do I do it in pyspark
I tried
df.sample(True,0.5,100)
But I do not know how to get to 100000 number exact
Okay, so first things first. You will probably not be able to get exactly 100,000 in your (over)sample. The reason why is that in order to sample efficiently, Spark uses something called Bernouilli Sampling. Basically this means it goes through your RDD, and assigns each row a probability of being included. So if you want a 10% sample, each row individually has a 10% chance of being included but it doesn't take into account if it adds up perfectly to the number you want, but it tends to be pretty close for large datasets.
The code would look like this: df.sample(True, 11.11111, 100). This will take a sample of the dataset equal to 11.11111 times the size of the original dataset. Since 11.11111*9,000 ~= 100,000, you will get approximately 100,000 rows.
If you want an exact sample, you have to use df.takeSample(True, 100000). However, this is not a distributed dataset. This code will return an Array (a very large one). If it can be created in Main Memory then do that. However, because you require the exact right number of IDs, I don't know of a way to do that in a distributed fashion.
I have to write oracle procedure registering 16 digit security numbers in registered_security_numbers.
I know that first 6 digits of security number are either 1234 11 or 1234 12 , rest 10 digits are generated randomly.
I have 2 possible solutions :
Write second procedure,which generates all possible security numbers and inserts them in possible_security_numbers table, setting property free=1 .
Then when i get a request to register a new security number, I query possible_security_numbers table for a random security number,which is free and insert it in registered_security_numbers.
Every time I get a request to register security number, I generate random number from 1234 1100 0000 0000 - 1234 1299 9999 9999 range until I get security number, which does not exist in registered_security_numbers table and insert it in registered_security_numbers.
(1) approach i don't like because possible_security_numbers table will contain several billion entries and I am not sure how good it is or how fast select/update can be run.
(2) approach I don't like because if I have many records in registered_security_numbers table, generating random number from a range might be repeated many times.
I'd like to know if anyone has other solution or can comment on my solutions, which seem bad to me …
How many numbers are you actually going to generate?
Imagine that, at most, you're going to generate 1 million (10^6) numbers. If so, the odds that you're going to need to generate a second random number is roughly 5 in 10^-5 (0.00005 or 0.005%). If that's the case, it makes little sense to worry about the expense of occasionally generating a second number or the near impossibility of generating a third. The second approach will be much more efficient.
On the other hand, imagine that you intend to generate 1 billion numbers over time. If that's the case, then by the end, the odds that you're going to need to generate a second number is 5% and you'll need to generate 3 or 4 numbers reasonably often. Here, the trade-offs are much harder to figure out. Depending on the business, the performance impact of catching the unique constraint violation exception and generating multiple numbers on some calls may cause a service to violate the SLA often enough to matter while enumerating the valid numbers may be more efficient on average.
On the third hand, imagine that you intend to generate all 20 billion numbers over time. If that's the case, by the end, you'd expect to have to generate 10 billion random numbers before you found the one remaining valid number. If that's the case, the clear advantage will be with the first option of enumerating all possible numbers and tracking which ones have been used.
I am generating about 100 million random numbers to pick from 300 things. I need to set it up so that I have 10 million independent instances (different seed) that picks 10 times each. The goal is for the aggregate results to have very low discrepancy, as in, each item gets picked about the same number of times.
The problem is with a regular prng, some numbers get chosen more than others. (tried lcg and mersenne twister) The difference between the most picked and least picked can be several thousand, to ten thousands) With linear congruity generators and mersenne twister, I also tried picking 100 million times with 1 instance and that also didn't yield uniform results. I'm guessing this is because the period is very long, and perhaps 100 million isn't big enough. Theoretically, if I pick enough numbers, the results should reach uniformity. (should settle at the expected value)
I switched to Sobol, a quasirandom generator and got much better results with the 100 million from 1 instance test. (difference between most picked and least picked is about 5) But splitting them up to 10 million instances at 10 times each, the uniformity was lost and I got similar results as with the prng. Sobol seem very sensitive to sequence - skipping ahead randomly diminishes uniformity.
Is there a class of random generators that can maintain quasirandom-like low discrepancy even when combining 10 million independent instances? Or is that theoretically impossible? One solution I can think of now is to use 1 Sobol generator that is shared across 10 million instances, so effectively it is the same as the 100 million from 1 instance test.
Both the shuffling and proper use of Sobol should give you uniformity as desired. Shuffling needs to be done at the aggregate level (start with a global 100M sample having the desired aggregate frequencies, then shuffle it to introduce randomness, and finally split into the 10 values instances; shuffling within each instance wouldnt help globally, as you noted).
But that's an additional level of uniformity, you might not really need that: randomness might be enough.
First of all I would check the check itself, because it sounds strange that with enough samples you're really getting significant deviations (check "chi square test" to qualify such significance, or equivalently how many are "enough" samples). So for a first safety check: if you're picking independent values, then simplify differently to 10M instances picking 10 out 2 categories: do you get approximately a binomial distribution? For exclusive picking it's a different distribution (hypergeometric iirc, but need to check). Then generalize to more categories (multinomial distribution) and only later it's safe to proceed with your problem.
Sort at most 10 million 7-digit numbers. constraints: 1M RAM, high speed. several secs is good.
[Edit: from comment by questioner: the input values are distinct]
Using Bitmap Data Structure is a good solution for this problem.
That means I need a string, which length is at most 10 Million????
So is the RAM enough for this?
confused here.
Thank you
So, there are ~8,000,000 bits in 1MB but if you have arbitrary 7 digit numbers (up to 9,999,999) using a bit vector to do the sort won't work. Similarly, it won't work if some numbers can be repeated because you can only store {0,1} in a bit vector.
But assuming, (what I think your problem is asking) that you have a sequence of integers between 0 and 8,000,000 with no duplicates, you can simply allocate a zeroed array of 8,000,000 bits and then for each number, tick off the corresponding bit in the array. Then outputting the sorted list is simply reading through that array in sequence and outputting the index for each 1 value.
If you are asking the more complex version of the question (0 - 10 million, repeats allowed), then you will need to to sort chunks that fit in ram, store them on disk, and then you can merge these chunks in linear time and streaming (so you don't ever have to store the whole thing in memory). Here is an implementation of a very similar thing in python: http://neopythonic.blogspot.com/2008/10/sorting-million-32-bit-integers-in-2mb.html
Start with a bit array representing the lowest 8 million possible values. Read through the input and set a bit for every value within range. Then output all the numbers for the turned-on bits in sequence. Next Clear the first 2 million bits of the array so that it can represent the highest 2 million possible values. Read through the input and set a bit for every value in the new range. Output all the values in this range. Done.
I am a graduate student of physics and I am working on writing some code to sort several hundred gigabytes of data and return slices of that data when I ask for it. Here is the trick, I know of no good method for sorting and searching data of this kind.
My data essentially consists of a large number of sets of numbers. These sets can contain anywhere from 1 to n numbers within them (though in 99.9% of the sets, n is less than 15) and there are approximately 1.5 ~ 2 billion of these sets (unfortunately this size precludes a brute force search).
I need to be able to specify a set with k elements and have every set with k+1 elements or more that contains the specified subset returned to me.
Simple Example:
Suppose I have the following sets for my data:
(1,2,3)
(1,2,3,4,5)
(4,5,6,7)
(1,3,8,9)
(5,8,11)
If I were to give the request (1,3) I would have the sets: (1,2,3),
(1,2,3,4,5), and (1,3,8,9).
The request (11) would return the set: (5,8,11).
The request (1,2,3) would return the sets: (1,2,3) and (1,2,3,4,5)
The request (50) would return no sets:
By now the pattern should be clear. The major difference between this example and my data is that the sets withn my data are larger, the numbers used for each element of the sets run from 0 to 16383 (14 bits), and there are many many many more sets.
If it matters I am writing this program in C++ though I also know java, c, some assembly, some fortran, and some perl.
Does anyone have any clues as to how to pull this off?
edit:
To answer a couple questions and add a few points:
1.) The data does not change. It was all taken in one long set of runs (each broken into 2 gig files).
2.) As for storage space. The raw data takes up approximately 250 gigabytes. I estimate that after processing and stripping off a lot of extraneous metadata that I am not interested in I could knock that down to anywhere from 36 to 48 gigabytes depending on how much metadata I decide to keep (without indices). Additionally if in my initial processing of the data I encounter enough sets that are the same I might be able to comress the data yet further by adding counters for repeat events rather than simply repeating the events over and over again.
3.) Each number within a processed set actually contains at LEAST two numbers 14 bits for the data itself (detected energy) and 7 bits for metadata (detector number). So I will need at LEAST three bytes per number.
4.) My "though in 99.9% of the sets, n is less than 15" comment was misleading. In a preliminary glance through some of the chunks of the data I find that I have sets that contain as many as 22 numbers but the median is 5 numbers per set and the average is 6 numbers per set.
5.) While I like the idea of building an index of pointers into files I am a bit leery because for requests involving more than one number I am left with the semi slow task (at least I think it is slow) of finding the set of all pointers common to the lists, ie finding the greatest common subset for a given number of sets.
6.) In terms of resources available to me, I can muster approximately 300 gigs of space after I have the raw data on the system (The remainder of my quota on that system). The system is a dual processor server with 2 quad core amd opterons and 16 gigabytes of ram.
7.) Yes 0 can occur, it is an artifact of the data acquisition system when it does but it can occur.
Your problem is the same as that faced by search engines. "I have a bajillion documents. I need the ones which contain this set of words." You just have (very conveniently), integers instead of words, and smallish documents. The solution is an inverted index. Introduction to Information Retrieval by Manning et al is (at that link) available free online, is very readable, and will go into a lot of detail about how to do this.
You're going to have to pay a price in disk space, but it can be parallelized, and should be more than fast enough to meet your timing requirements, once the index is constructed.
Assuming a random distribution of 0-16383, with a consistent 15 elements per set, and two billion sets, each element would appear in approximately 1.8M sets. Have you considered (and do you have the capacity for) building a 16384x~1.8M (30B entries, 4 bytes each) lookup table? Given such a table, you could query which sets contain (1) and (17) and (5555) and then find the intersections of those three ~1.8M-element lists.
My guess is as follows.
Assume that each set has a name or ID or address (a 4-byte number will do if there are only 2 billion of them).
Now walk through all the sets once, and create the following output files:
A file which contains the IDs of all the sets which contain '1'
A file which contains the IDs of all the sets which contain '2'
A file which contains the IDs of all the sets which contain '3'
... etc ...
If there are 16 entries per set, then on average each of these 2^16 files will contain the IDs of 2^20 sets; with each ID being 4 bytes, this would require 2^38 bytes (256 GB) of storage.
You'll do the above once, before you process requests.
When you receive requests, use these files as follows:
Look at a couple of numbers in the request
Open up a couple of the corresponding index files
Get the list of all sets which exist in both these files (there's only a million IDs in each file, so this should't be difficult)
See which of these few sets satisfy the remainder of the request
My guess is that if you do the above, creating the indexes will be (very) slow and handling requests will be (very) quick.
I have recently discovered methods that use Space Filling curves to map the multi-dimensional data down to a single dimension. One can then index the data based on its 1D index. Range queries can be easily carried out by finding the segments of the curve that intersect the box that represents the curve and then retrieving those segments.
I believe that this method is far superior to making the insane indexes as suggested because after looking at it, the index would be as large as the data I wished to store, hardly a good thing. A somewhat more detailed explanation of this can be found at:
http://www.ddj.com/184410998
and
http://www.dcs.bbk.ac.uk/~jkl/publications.html
Make 16383 index files, one for each possible search value. For each value in your input set, write the file position of the start of the set into the corresponding index file. It is important that each of the index files contains the same number for the same set. Now each index file will consist of ascending indexes into the master file.
To search, start reading the index files corresponding to each search value. If you read an index that's lower than the index you read from another file, discard it and read another one. When you get the same index from all of the files, that's a match - obtain the set from the master file, and read a new index from each of the index files. Once you reach the end of any of the index files, you're done.
If your values are evenly distributed, each index file will contain 1/16383 of the input sets. If your average search set consists of 6 values, you will be doing a linear pass over 6/16383 of your original input. It's still an O(n) solution, but your n is a bit smaller now.
P.S. Is zero an impossible result value, or do you really have 16384 possibilities?
Just playing devil's advocate for an approach which includes brute force + index lookup :
Create an index with the min , max and no of elements of sets.
Then apply brute force excluding sets where max < max(set being searched) and min > min (set being searched)
In brute force also exclude sets whole element count is less than that of the set being searched.
95% of your searches would really be brute forcing a very smaller subset. Just a thought.