I have distributed 50 millions of ids within a numeric space of the size of 10^30. Ids are distributed randomly, no series or reversed function could be found. For example, the minimum and the maximum are:
25083112306903763728975529743
29353757632236106718171971627
Two consecutive ids have a distance in the order at least of 10^19. For example:
28249462572807242052513352500
28249462537043093417625790615
This distribution is solid to a brute force attack since to find 1 consecutive to another, it will take at least 10^19 search (to have an idea about timing, 1000 search it will take 1 second then it will spend 10^16 seconds...).
There are other search algorithms to search in this space that could take less time and make my ids distribution less solid?
If your 50 millions are really randomly distributed in a space of 10^30, you can't do anything better than brute force.
This means you can only iterate your 10^30 values in a random order, and in average you have to test 10^30 / (5 10^7) = 2.10^22 to find one.
Of course, there exists an algorithm to find all of them at first try, but it's extremely unlikely that you stumble on it without knowing the ids first.
Related
Mega Sena is Brazil's most famous lottery. The number set ranges from 1 to 60 and a single bet can contain each from 6 to 15 numbers selected (the more numbers selected the more expensive that bet is). Prizes are given, proportionally, for those who can correctly guess 4, 5 or 6 numbers, the latter being the jackpot.
Sometimes the total number of bets reach the hundred of millions. Drawings are televisioned live so, as a player, you can know almost instantaneously if you won or not. However, the lottery takes several hours to announce the number of winners and which cities they are from.
Given the rules above, how would you implement the computational logic that finds the winners? What data structures would you choose to store the data? Which sorting algorithms, if any, would you select? What's the Big O notation of the best solution you can think of?
I would allocate a 64-bit integer to represent which numbers have been chosen.
The first step would be expand all the 7-15 number tickets to tickets having only 6 bits, which could be done offline before the drawing. I don't know if the 15-number ticket contains all the (15 choose 6) = 5005 6-element tickets, or do they use another system. But since it's done offline, the complexity is delegated elsewhere.
There's even an algorithm (or bithack) called lexicographically next permutation, which is able to generate all those n choose k bit patterns efficiently, if it needs to be done in real time.
Mask all those tickets with the bit pattern of the winning row, and compute the number of bits left. This should be extremely efficient taking an order of one second for a billion tickets in a modern computer that has the popcount instruction.
The other issue is the validation, integrity and confidentiality of the data, when associated to the ticket holders. I would guess that this is the real issue and is probably addressed by implementing the whole thing by a database query.
You can hold the selection of each person within a single 64-bit word with each bit representing a selected number. The entire dataset could fit in memory in one long integer array.
If the array was sorted in the same order as your database, e.g., by ticket ID, then you could retrieve an associated record simply from knowing that the position in the array would be the same as the rownum of your query.
If you performed a bitwise AND operation of each value with the 64-bit representation of the winning numbers, you could count the bits and store the offsets of any matching 4, 5, or 6 bits into their own respective lists.
The whole operation would be pretty obviously linear O(n).
Suppose I have an ordered list of weights, having length M. I want to divide this list into N ordered non-empty sublists, where the sum of the weights in each sublist are as close to each other as possible. Finally, the length of the list will always be greater than or equal to the number of partitions.
For example:
A reader of epoch fantasy wants to read the entire Wheel of Time series in N = 90 days. She wants to read approximately the same amount of words each day, but she doesn't want to break a single chapter across two days. Obviously, she also doesn't want to read it out of order either. The series has a total of M chapters, and she has a list of the word counts in each.
What algorithm could she use to calculate the optimum reading schedule?
In this example, the weights probably won't vary much, but the algorithm I'm seeking should be general enough to handle weights that vary widely.
As for what I consider optimum, I would say that given the choice between having two or three partitions vary in weight a small amount from the average would be better than having one partition vary a lot. Or in other words, She would rather have several days where she reads a few hundred more or fewer words than the average, if it means she can avoid having to read a thousand words more or fewer than the average, even once. My thinking is to use something like this to compute the score of any given solution:
let W_1, W_2, W_3 ... w_N be the weights of each partition (calculated by simply summing the weights of its elements).
let x be the total weight of the list, divided by its length M.
Then the score would be the sum, where I goes from 1 to N of (X - w_i)^2
So, I think I know a way to score each solution. The question is, what's the best way to minimize the score, other than brute force?
Any help or pointers in the right direction would be much appreciated!
As hinted by the first entry under "Related" on the right column of this page, you are probably looking for a "minimum raggedness word wrap" algorithm.
Every now and then I read all those conspiracy theories about Lotto-based games being controlled and a computer browsing through the combinations chosen by the players and determining the non-used subset. It got me thinking - how would such algorithm have to work in order to determine such subsets really efficiently? Finding non-used numbers is definitely crossed out as is finding the least used because it's not necesserily providing us with a solution. Also, going deeper, how could an algorithm efficiently choose such a subset that it was used some k times by the players? Saying more formally:
We are given a set of 50 numbers 1 to 50. In the draw 6 numbers are picked.
INPUT: m subsets each consisting of 6 distinct numbers 1 to 50 each,
integer k (0<=k) being the maximum players having all of their 6
numbers correct.
OUTPUT: Subsets which make not more than k players win the jackpot ('winning' means all the numbers they chose were picked in the draw).
Is there any efficient algorithm which could calculate this without using a terrabyte HDD to store all the encounters of every possible 50!/(44!*6!) in the pessimistic case? Honestly, I can't think of any.
If I wanted to run such a conspirancy I would first of all acquire the list of submissions by players. Then I would generate random lottery selections and see how many winners would be produced by each such selection. Then just choose the random lottery selection most attractive to me. There is little point doing anything more sophisticated, because that is probably already powerful enough to be noticed by staticians.
If you want to corrupt the lottery it would probably be easier and safer to select a few competitors you favour and have them win the lottery. In (the book) "1984" I think the state simply announced imaginary lottery winners, with the announcement in each area announcing somebody outside the area. One of the ideas in "The Beckoning Lady" by Margery Allingham is of a gang who attempt to set up a racecourse so they can rig races to allow them to disguise bribes as winnings.
First of all, the total number of combinations (choosing 6 from 50) is not very large. It is about 16 million which can be easily handled.
For each combination keep a count of number of people who played it. While declaring a winner choose the combination that has less than k plays.
If the number within each subset are sorted, then you can treat your subsets as strings - sort them in lexicographical order, then it is easy to count how many players selected each subset (and which subsets were not selected at all). So the time is proportional to the number of players and not the number of numbers in the lottery.
say i have a list of 100 numbers, i want to split them into 5 groups which have the sum within each group closest to the mean of the numbers.
the simplest solution is to sort the hundred numbers and take the max number and keep adding the smallest numbers until the sum goes beyond the avg.
obviously that is not going to bring the best results. i guess we could use BFS or DFS or some other search algo. like A* to get the best result.
does anyone have a simple solution out there? pseudo code is good enough. thanks!
This sounds like a variation on the knapsack problem and if I'm interpreting you correctly, it may be the multiple knapsack problem. Can't you come up with an easy problem? :)
An efficient algorithm (solution) that can be used is a variation of the Best Fit of bin packing algorithm. However, we would have to be applying a variation in which we have specifically 5 different groups of numbers that we want rather than looking for using the smallest number of groups.
The algorithm begins by finding the mean of the list of all 100 numbers. This mean will be used as the max capacity for all 5 of the groups (bins) we are trying to fit numbers into. We then find the largest number in our list of 100 numbers that doesn’t exceed the max capacity of our group and assign it to our first group. (We can find this in log(n) time because we can use a self-balancing binary search tree). We keep track of the how filled our current group is. We then find the next largest number that fits into our current group until we have either reached max capacity or there are no other numbers that will allow this group to reach max capacity. In either of these cases we move on to the next group and repeat our algorithm with the numbers left in our list of numbers. Once we move on from a group we must also keep track of the current sum of that group. We continue until we have reached the highest capacity for all 5 groups. If there are any numbers left we place them in the groups that have the lowest total sum (because we kept track of these sums as we went).
This is in fact a greedy algorithm with a Θ(nlogn) running time due to the nature of bin packing.
I have a process that generates values and that I observe. When the process terminates, I want to compute the median of those values.
If I had to compute the mean, I could just store the sum and the number of generated values and thus have O(1) memory requirement. How about the median? Is there a way to save on the obvious O(n) coming from storing all the values?
Edit: Interested in 2 cases: 1) the stream length is known, 2) it's not.
You are going to need to store at least ceil(n/2) points, because any one of the first n/2 points could be the median. It is probably simplest to just store the points and find the median. If saving ceil(n/2) points is of value, then read in the first n/2 points into a sorted list (a binary tree is probably best), then as new points are added throw out the low or high points and keep track of the number of points on either end thrown out.
Edit:
If the stream length is unknown, then obviously, as Stephen observed in the comments, then we have no choice but to remember everything. If duplicate items are likely, we could possibly save a bit of memory using Dolphins idea of storing values and counts.
I had the same problem and got a way that has not been posted here. Hopefully my answer can help someone in the future.
If you know your value range and don't care much about median value precision, you can incrementally create a histogram of quantized values using constant memory. Then it is easy to find median or any position of values, with your quantization error.
For example, suppose your data stream is image pixel values and you know these values are integers all falling within 0~255. To create the image histogram incrementally, just create 256 counters (bins) starting from zeros and count one on the bin corresponding to the pixel value while scanning through the input. Once the histogram is created, find the first cumulative count that is larger than half of the data size to get median.
For data that are real numbers, you can still compute histogram with each bin having quantized values (e.g. bins of 10's, 1's, or 0.1's etc.), depending on your expected data value range and precision you want.
If you don't know the value range of entire data sample, you can still estimate the possible value range of median and compute histogram within this range. This drops outliers by nature but is exactly what we want when computing median.
You can
Use statistics, if that's acceptable - for example, you could use sampling.
Use knowledge about your number stream
using a counting sort like approach: k distinct values means storing O(k) memory)
or toss out known outliers and keep a (high,low) counter.
If you know you have no duplicates, you could use a bitmap... but that's just a smaller constant for O(n).
If you have discrete values and lots of repetition you could store the values and counts, which would save a bit of space.
Possibly at stages through the computation you could discard the top 'n' and bottom 'n' values, as long as you are sure that the median is not in that top or bottom range.
e.g. Let's say you are expecting 100,000 values. Every time your stored number gets to (say) 12,000 you could discard the highest 1000 and lowest 1000, dropping storage back to 10,000.
If the distribution of values is fairly consistent, this would work well. However if there is a possibility that you will receive a large number of very high or very low values near the end, that might distort your computation. Basically if you discard a "high" value that is less than the (eventual) median or a "low" value that is equal or greater than the (eventual) median then your calculation is off.
Update
Bit of an example
Let's say that the data set is the numbers 1,2,3,4,5,6,7,8,9.
By inspection the median is 5.
Let's say that the first 5 numbers you get are 1,3,5,7,9.
To save space we discard the highest and lowest, leaving 3,5,7
Now get two more, 2,6 so our storage is 2,3,5,6,7
Discard the highest and lowest, leaving 3,5,6
Get the last two 4,8 and we have 3,4,5,6,8
Median is still 5 and the world is a good place.
However, lets say that the first five numbers we get are 1,2,3,4,5
Discard top and bottom leaving 2,3,4
Get two more 6,7 and we have 2,3,4,6,7
Discard top and bottom leaving 3,4,6
Get last two 8,9 and we have 3,4,6,8,9
With a median of 6 which is incorrect.
If our numbers are well distributed, we can keep trimming the extremities. If they might be bunched in lots of large or lots of small numbers, then discarding is risky.