How to shuffle eight items to approximate maximum entropy? - random

I need to analyze 8 chemical samples repeatedly over 5 days (each sample is analyzed exactly once every day). I'd like to generate pseudo-random sample sequences for each day which achieve the following:
avoid bias in the daily sequence position (e.g., avoid some samples being processed mostly in the morning)
avoid repeating sample pairs over different days (e.g. 12345678 on day 1 and 87654321 on day 2)
generally randomize the distance between two given samples from one day to the other
I may have poorly phrased the conditions above, but the general idea is to minimize systematic effects like sample cross-contamination and/or analytical drift over each day. I could just shuffle each sequence randomly, but because the number of sequences generated is small (N=5 versus 40,320 possible combinations), I'm unlikely to approach something like maximum entropy.
Any ideas? I suspect this is a common problem in analytical science which has been solved, but I don't know where to look.

By just thinking about:
The base metric that you may be use is the Levenshtein distances or some slightly modification (maybe
myDist(w1, w2) = min(levD(w1, w2), levD(w1.reversed(), w2))
)
Since you want to prevent near distances between any pair of days,
the overall metric can be the sum of the any combinations of sample orders between two days.
Similarity = myDist(day1, day2)
+ myDist(day1, day3)
+ myDist(day1, day4)
+ myDist(day1, day5)
+ myDist(day2, day3)
+ myDist(day2, day4)
+ myDist(day2, day5)
+ myDist(day3, day4)
+ myDist(day3, day5)
+ myDist(day4, day5)
That still is missing, is a heuristic how to create the sample orders.
Your problem reminds me on some fastest path finding problem but with the further difficulty that each selected node influences the weights of the whole graph. So it is much harder.
Maybe a table with all myDistdistances between each pair of the 8! combinations can be created (its commutative, only triangular (without identity diagonal) matrix requiring (~1GB memory)) This may help speeding things up very much.
Maybe take the max from this matrix and consider each combination with value below some threshold as equally worthless to reduce the searchspace.
Build a starting set.
Use 12345678 as the fix day1 since first day does not matter. Never change this.
then repeat until n days are chose:
adding the most distant point from current point.
If there are multiple equal possibility, use the one, that also is most distant from the previous days.
Now iteratively improve the solution - maybe with some ruin-and-recreate-approach. You should always backup the absolute maximum you found and you are able to run as many iterations as you want (and you have time for)
chose (one or two) day(s) with the smallest distance sums to other days
maybe brute force an optimal (in terms of overall distance) combination for these two days.
repeat
If optimization stucks (only same 2 days are chosen or distance is not getting smaller at all)
randomly change one or two days to random orders.
may be totally random (beside day1) starting sets can be selected

Related

A good randomizer for puzzle-15

I have implemented a puzzle 15 for people to compete online. My current randomizer works by starting from the good configuration and moving tiles around for 100 moves (arbitrary number)
Everything is fine, however, once in a little while the tiles are shuffled too easy and it takes only a few moves to solve the puzzle, therefore the game is really unfair for some people reaching better scores in a much higher speed.
What would be a good way to randomize the initial configuration so it is not "too easy"?
You can generate a completely random configuration (that is solvable) and then use some solver to determine the optimal sequence of moves. If the sequence is long enough for you, good, otherwise generate a new configuration and repeat.
Update & details
There is an article on Wikipedia about the 15-puzzle and when it is (and isn't) solvable. In short, if the empty square is in the lower-right corner, then the puzzle is solvable if and only if the number of inversions (an inversion is a swap of two elements in the sequence, not necessarily adjacent elements) with respect to the goal permutation is even.
You can then easily generate a solvable start state by doing an even number of inversions, which may lead to a not-so-easy-to-solve state far quicker than by doing regular moves, and it is guaranteed that it will remain solvable.
In fact, you don't need to use a search algorithm as I mentioned above, but an admissible heuristic. Such one always underestimates never overestimates the number of moves needed to solve the puzzle, i.e. you are guaranteed that it will not take less moves that the heuristic tells you.
A good heuristic is the sum of manhattan distances of each number to its goal position.
Summary
In short, a possible (very simple) algorithm for generating starting positions might look like this:
1: current_state <- goal_state
2: swap two arbitrary (randomly selected) pieces
3: swap two arbitrary (randomly selected) pieces again (to ensure solvability)
4: h <- heuristic(current_state)
5: if h > desired threshold
6: return current_state
7: else
8: go to 2.
To be absolutely certain about how difficult a state is, you need to find the optimal solution using some solver. Heuristics will give you only an estimate.
I would do this
start from solution (just like you did)
make valid turn in random direction
so you must keep track where the gap is and generate random direction (N,E,S,W) and do the move. I think this part you have done too.
compute the randomness of your placements
So compute some coefficient dependent on the order of the array. So ordered (solved) solutions will have low values and random will have high values. The equation for the coefficiet however is a matter of trial and error. Here some ideas what to use:
correlation coefficient
sum of average difference of value and its neighbors
1 2 4
3 6 5
9 8 7
coeff(6)= (|6-3|+|6-5|+|6-2|+|6-8|)/4
coeff=coeff(1)+coeff(2)+...coeff(15)
abs distance from ordered array
You can combine more approaches together. You can divide this to separated rows and columns and then combine the sub coefficients together.
loop #2 unit coefficient from #3 is high enough (treshold)
The treshold can be used also to change the difficulty.

Most optimal match-up

Let's assume you're a baseball manager. And you have N pitchers in your bullpen (N<=14) and they have to face M batters (M<=100). Also to mention you know the strength of each of the pitchers and each of the batters. For those who are not familiar to baseball once you brought in a relief pitcher he can pitch to k consecutive batters, but once he's taken out ofthe game he cannot come back.
For each pitcher the probability that he's gonna lose his match-ups is given by (sum of all batter he will face)/(his strength). Try to minimize these probabilities, i.e. try to maximize your chances of winning the game.
For example we have 3 pitchers and they have to face 3 batters. The batters' stregnths are:
10 40 30
While the strength of your pitchers is:
40 30 3
The most optimal solution would be to bring the strongest pitcher to face the first 2 batters and the second to face the third batter. Then the probability of every pitcher losing his game will be:
50/40 = 1.25 and 30/30 = 1
So the probability of losing the game would be 1.25 (This number can be bigger than 100).
How can you find the optimal number? I was thinking to take a greedy approach, but I suspect whether it will always hold. Also the fact that the pitcher can face unlimited number of batters (I mean it's only limited by M) poses the major problem for me.
Probabilities must be in the range [0.0, 1.0] so what you call a probability can't be a probability. I'm just going to call it a score and minimize it.
I'm going to assume for now that you somehow know the order in which the pitchers should play.
Given the order, what is left to decide is how long each pitcher plays. I think you can find this out using dynamic programming. Consider the batters to be faced in order. Build an NxM table best[pitchers, batter] where best[i, j] is the best score you can make considering just the first j batters using the first i pitchers, or HUGE if it does not make sense.
best[1,1] is just the score for the best pitcher against the first batter, and best[1,j] doesn't make sense for any other values of j.
For larger values of i you work out best[i,j] by considering when the last change of pitcher could be, considering all possibilities (so 1, 2, 3...i). If the last change of pitcher was at time t, then look up best[t, j-1] to get the score up to the time just before that change, and then calculate the a/b value to take account of the sum of batter strengths between time t+1 and time i. When you have considered all possible times, take the best score and use it as the value for best[i, j]. Note down enough info (such as the last time of pitcher change that turned out to be best) so that once you have calculated best[N, M], you can trace back to find the best schedule.
You don't actually know the order, and because the final score is the maximum of the a/b value for each pitcher, the order does matter. However, given a separation of players into groups, the best way to assign pitchers to groups is to assign the best pitcher to the group with the highest total score, the next best pitcher to the group with the next best total score, and so on. So you could alternate between dividing batters into groups, as described above, and then assigning pitchers to groups to work out the order the pitchers really should be in - keep doing this until the answer stops changing and hope the result is a global optimum. Unfortunately there is no guarantee of this.
I'm not convinced that your score is a good model for baseball, especially since it started out as a probability but can't be. Perhaps you should work out a few examples (maybe even solving small examples by brute force) and see if the results look reasonable.
Another way to approach this problem is via http://en.wikipedia.org/wiki/Branch_and_bound.
With branch and bound you need some way to describe partial answers, and you need some way to work out a value V for a given partial answer, such that no way of extending that partial answer can possibly produce a better answer than V. Then you run a tree search, extending partial answers in every possible way, but discarding partial answers which can't possibly be any better than the best answer found so far. It is good if you can start off with at least a guess at the best answer, because then you can discard poor partial answers from the start. My other answer might provide a way of getting this.
Here a partial answer is a selection of pitchers, in the order they should play, together with the number of batters they should pitch to. The first partial answer would have 0 pitchers, and you could extend this by choosing each possible pitcher, pitching to each possible number of batters, giving a list of partial answers each mentioning just one pitcher, most of which you could hopefully discard.
Given a partial answer, you can compute the (total batter strength)/(Pitcher strength) for each pitcher in its selection. The maximum found here is one possible way of working out V. There is another calculation you can do. Sum up the total strengths of all the batters left and divide by the total strengths of all the pitchers left. This would be the best possible result you could get for the pitchers left, because it is the result you get if you somehow manage to allocate pitchers to batters as evenly as possible. If this value is greater than the V you have calculated so far, use this instead of V to get a less optimistic (but more accurate) measure of how good any descendant of that partial answer could possibly be.

How can I measure trends in certain words, like Twitter?

I have newspaper articles' corpus by day. Each word in the corpus has a frequency count of being present that day. I have been toying with finding an algorithm that captures the break-away words, similar to the way Twitter measures Trends in people's tweets.
For Instance, say the word 'recession' appears with the following frequency in the same group of newspapers:
Day 1 | recession | 456
Day 2 | recession | 2134
Day 3 | recession | 3678
While 'europe'
Day 1 | europe | 67895
Day 2 | europe | 71999
Day 3 | europe | 73321
I was thinking of taking the % growth per day and multiplying it by the log of the sum of frequencies. Then I would take the average to score and compare various words.
In this case:
recession = (3.68*8.74+0.72*8.74)/2 = 19.23
europe = (0.06*12.27+0.02*12.27)/2 = 0.49
Is there a better way to capture the explosive growth? I'm trying to mine the daily corpus to find terms that are more and more mentioned in a specific time period across time. PLEASE let me know if there is a better algorithm. I want to be able to find words with high non-constant acceleration. Maybe taking the second derivative would be more effective. Or maybe I'm making this way too complex and watched too much physics programming on the discovery channel. Let me know with a math example if possible Thanks!
First thing to notice is that this can be approximated by a local problem. That is to say, a "trending" word really depends only upon recent data. So immediately we can truncate our data to the most recent N days where N is some experimentally determined optimal value. This significantly cuts down on the amount of data we have to look at.
In fact, the NPR article suggests this.
Then you need to somehow look at growth. And this is precisely what the derivative captures. First thing to do is normalize the data. Divide all your data points by the value of the first data point. This makes it so that the large growth of an infrequent word isn't drowned out by the relatively small growth of a popular word.
For the first derivative, do something like this:
d[i] = (data[i] - data[i+k])/k
for some experimentally determined value of k (which, in this case, is a number of days). Similarly, the second derivative can be expressed as:
d2[i] = (data[i] - 2*data[i+k] + data[i+2k])/(2k)
Higher derivatives can also be expressed like this. Then you need to assign some kind of weighting system for these derivatives. This is a purely experimental procedure which really depends on what you want to consider "trending." For example, you might want to give acceleration of growth half as much weight as the velocity. Another thing to note is that you should try your best to remove noise from your data because derivatives are very sensitive to noise. You do this by carefully choosing your value for k as well as discarding words with very low frequencies altogether.
I also notice that you multiply by the log sum of the frequencies. I presume this is to give the growth of popular words more weight (because more popular words are less likely to trend in the first place). The standard way of measuring how popular a word is is by looking at it's inverse document frequency (IDF).
I would divide by the IDF of a word to give the growth of more popular words more weight.
IDF[word] = log(D/(df[word))
where D is the total number of documents (e.g. for Twitter it would be the total number of tweets) and df[word] is the number of documents containing word (e.g. the number of tweets containing a word).
A high IDF corresponds to an unpopular word whereas a low IDF corresponds to a popular word.
The problem with your approach (measuring daily growth in percentage) is that it disregards the usual "background level" of the word, as your example shows; 'europe' grows more quickly than 'recession', yet is has a much lower score.
If the background level of words has a well-behaved distribution (Gaussian, or something else that doesn't wander too far from the mean) then I think a modification of CanSpice's suggestion would be a good idea. Work out the mean and standard deviation for each word, using days C-N+1-T to C-T, where C is the current date, N is the number of days to take into account, and T is the number of days that define a trend.
Say for instance N=90 and T=3, so we use about three months for the background, and say a trend is defined by three peaks in a row. In that case, for example, you can rank the words according to their chi-squared p-value, calculated like so:
(mu, sigma) = fitGaussian(word='europe', startday=C-N+1-3, endday=C-3)
X1 = count(word='europe', day=C-2)
X2 = count(word='europe', day=C-1)
X3 = count(word='europe', day=C)
S = ((X1-mu)/sigma)^2 + ((X2-mu)/sigma)^2 + ((X3-mu)/sigma)^2
p = pval.chisq(S, df=3)
Essentially then, you can get the words which over the last three days are the most extreme compared to their background level.
I would first try a simple solution. A simple weighted difference between adjacent day should probably work. Maybe taking the log before that. You might have to experiment with the weights. For examle (-2,-1,1,2) would give you points where the data is exploding.
If this is not enough, you can try slope filtering ( http://www.claysturner.com/dsp/fir_regression.pdf ). Since the algorithm is based on linear regression, it should be possible to modify it for other types of regression (for example quadratic).
All attempts using filtering techniques such as these also have the advantage, that they can be made to run very fast and you should be able to find libraries that provide fast filtering.

Minimum-Waste Print Job Grouping Algorithm?

I work at a publishing house and I am setting up one of our presses for "ganging", in other words, printing multiple jobs simultaneously. Given that different print jobs can have different quantities, and anywhere from 1 to 20 jobs might need to be considered at a time, the problem would be to determine which jobs to group together to minimize waste (waste coming from over-printing on smaller-quantity jobs in a given set, that is).
Given the following stable data:
All jobs are equal in terms of spatial size--placement on paper doesn't come into consideration.
There are three "lanes", meaning that three jobs can be printed simultaneously.
Ideally, each lane has one job. Part of the problem is minimizing how many lanes each job is run on.
If necessary, one job could be run on two lanes, with a second job on the third lane.
The "grouping" waste from a given set of jobs (let's say the quantities of them are x, y and z) would be the highest number minus the two lower numbers. So if x is the higher number, the grouping waste would be (x - y) + (x - z). Otherwise stated, waste is produced by printing job Y and Z (in excess of their quantities) up to the quantity of X. The grouping waste would be a qualifier for the given set, meaning it could not exceed a certain quantity or the job would simply be printed alone.
So the question is stated: how to determine which sets of jobs are grouped together, out of any given number of jobs, based on the qualifiers of 1) Three similar quantities OR 2) Two quantities where one is approximately double the other, AND with the aim of minimal total grouping waste across the various sets.
(Edit) Quantity Information:
Typical job quantities can be from 150 to 350 on foreign languages, or 500 to 1000 on English print runs. This data can be used to set up some scenarios for an algorithm. For example, let's say you had 5 jobs:
1000, 500, 500, 450, 250
By looking at it, I can see a couple of answers. Obviously (1000/500/500) is not efficient as you'll have a grouping waste of 1000. (500/500/450) is better as you'll have a waste of 50, but then you run (1000) and (250) alone. But you could also run (1000/500) with 1000 on two lanes, (500/250) with 500 on two lanes and then (450) alone.
In terms of trade-offs for lane minimization vs. wastage, we could say that any grouping waste over 200 is excessive.
(End Edit)
...Needless to say, quite a problem. (For me.)
I am a moderately skilled programmer but I do not have much familiarity with algorithms and I am not fully studied in the mathematics of the area. I'm I/P writing a sort of brute-force program that simply tries all options, neglecting any option tree that seems to have excessive grouping waste. However, I can't help but hope there's an easier and more efficient method.
I've looked at various websites trying to find out more about algorithms in general and have been slogging my way through the symbology, but it's slow going. Unfortunately, Wikipedia's articles on the subject are very cross-dependent and it's difficult to find an "in". The only thing I've been able to really find would seem to be a definition of the rough type of algorithm I need: "Exclusive Distance Clustering", one-dimensionally speaking.
I did look at what seems to be the popularly referred-to algorithm on this site, the Bin Packing one, but I was unable to see exactly how it would work with my problem.
This seems similar to the classic Operations Research 'cutting stock' problem. For the formal mathematical treatment try
http://en.wikipedia.org/wiki/Cutting_stock_problem
I've coded solutions for the cutting stock problems using delayed column generation technique from the paper "Selection and Design of Heuristic Procedures for Solving Roll Trim Problems" by Robert W. Haessler (Management Sci. Dec '88). I tested it up to a hundred rolls without problem. Understanding how to get the residuals from the first iteration, and using them to craft the new equation for the next iteration is quite interesting. See if you can get hold of this paper, as the author discusses variations closer to your problem.
If you get to a technique that's workable I recommend using a capable linear algebra solver, rather than re-inventing the wheel. Whilst simplex method is easy enough to code yourself for fractional solutions, what you are dealing with here is harder - it's a mixed integer problem. For a modern C mixed integer solver (MIP) using eg. branch & bound, with Java/python bindings I recommend lp_solve.
When I wrote this I found this NEOS guide page useful. Online solver looks defunct through (for me it returns perl code rather than executing it). There's still some background information.
Edit - a few notes: I'll summarise the differences between your problem and that of the cutting stock:
1) cutting stock has input lengths that are indivisible. You can simulate your divisible problems by running the problem multiple times, breaking up the jobs into 1.0, {0.5, 0.5} time original lengths.
2) your 'length of print run' map to the section length
3) choose a large stock length
I'm going to try and attack the "ideal" case, in which no jobs are split between lanes or printed alone.
Let n be the number of jobs, rounded up to the nearest multiple of 3. Dummy zero-length jobs can be created to make the number of jobs a multiple of 3.
If n=3, this is trivial, because there's only one possible solution. So assume n>3.
The job (or one of the jobs if there are several) with the highest quantity must inevitably be the highest or joint-highest of the longest job group (or one of the joint longest job groups if there is a tie). Equal-quantity jobs are interchangeable, so just pick one and call that the highest if there is a tie.
So if n=6, you have two job groups, of which the longest-or-equal one has a fixed highest or joint-highest quantity job. So the only question is how to arrange the other 5 jobs between the groups. The formula for calculating the grouping waste can be expressed as 2∑hi - ∑xj where the his are the highest quantities in each group and the xjs are the other quantities. So moving from one possible solution to another is going to involve swapping one of the h's with one of the x's. (If you swapped one of the h's with another one of the h's, or one of the x's with another one of the x's, it wouldn't make any difference, so you wouldn't have moved to a different solution.) Since h2 is fixed and x1 and x2 are useless for us, what we are actually trying to minimise is w(h1, x3, x4) = 2h1 - (x3 + x4). If h1 <= x3 <= x4, this is an optimal grouping because no swap can improve the situation. (To see this, let d = x3 - h1 and note that w(x3, h1, x4) - w(h1, x3, x4) = 3d which is non-negative, and by symmetry the same argument holds for swapping with x4). So that deals with the case n=6.
For n=9, we have 8 jobs that can be moved around, but again, it's useless to move the shortest two. So the formula this time is w(h1, h2, x3, x4, x5, x6) = 2h1 + 2h2 - (x3 + x4 + x5 + x6), but this time we have the constraint that h2 must not be less than the second-smallest x in the formula (otherwise it couldn't be the highest or joint-highest of any group). As noted before, h1 and h2 can't be swapped with each other, so either you swap one of them with an appropriate x (without violating the constraint), or you swap both of them, each with a distinct x. Take h1 <= x3 <= x4 <= h2 <= x5 <= x6. Again, single swaps can't help, and a double swap can't help either because its effect must necessarily be the sum of the effects of two single swaps. So again, this is an optimal solution.
It looks like this argument is going to work for any n. In which case, finding an optimal solution when you've got an "ideal case" (as defined at the top of my answer) is going to be simple: sort the jobs by quantity and then chop the sorted list into consecutive groups of 3. If this solution proves not to be suitable, you know you haven't got an ideal case.
I will have a think about the non-ideal cases, and update this answer if I come up with anything.
If I understand the problem (and I am not sure I do), the solution could be as simple as printing job 1 in all three lanes, then job 2 in all three lanes, then job 3 in all three lanes.
It has a worst case of printing two extra sheets per job.
I can think of cases where this isn't optimal (e.g. three jobs of four sheets each would take six pages rather than four), but it is likely to be far far simpler to develop than a Bin Packing solution (which is NP-complete; each of the three lanes, over time, are representing the bins.)

An algorithm to sort a list of values into n groups so that the sum of each group is as close as possible

Basically I have a number of values that I need to split into n different groups so that the sums of each group are as close as possible to the sums of the others? The list of values isn't terribly long so I could potentially just brute force it but I was wondering if anyone knows of a more efficient method of doing this. Thanks.
If an approximate solution is enough, then sort the numbers descendingly, loop over them and assign each number to the group with the smallest sum.
groups = [list() for i in range(NUM_GROUPS)]
for x in sorted(numbers, reverse=True):
mingroup = groups[0]
for g in groups:
if sum(g) < sum(mingroup):
mingroup = g
mingroup.append(x)
This problem is called "multiway partition problem" and indeed is computationally hard. Googling for it returned an interesting paper "Multi-Way Number Paritioning where the author mentions the heuristic suggested by larsmans and proposes some more advanced algorithms. If the above heuristic is not enough, you may have a look at the paper or maybe contact the author, he seems to be doing research in that area.
Brute force might not work out as well as you think...
Presume you have 100 variables and 20 groups:
You can put 1 variable in 20 different groups, which makes 20 combinations.
You can put 2 variables in 20 different groups each, which makes 20 * 20 = 20^2 = 400 combinations.
You can put 3 variables in 20 different groups each, which makes 20 * 20 * 20 = 20^3 = 8000 combinations.
...
You can put 100 variables in 20 different groups each, which makes 20^100 combinations, which is more than the minimum number of atoms in the known universe (10^80).
OK, you can do that a bit smarter (it doesn't matter where you put the first variable, ...) to get to something like Branch and Bound, but that will still scale horribly.
So either use a fast deterministic algorithm, like larsman proposes.
Or if you need a more optimal solution and have the time to implement it, take a look at metaheuristic algorithms and software that implement them (such as Drools Planner).
You can sum the numbers and divide by the number of groups. This gives you the target value for the sums. Sort the numbers and then try to get subsets to add up to the required sum. Start with the largest values possible, as they will cause the most variability in the sums. Once you decide on a group that is not the optimal sum (but close), you could recompute the expected sum of the remaining numbers (over n-1 groups) to minimize the RMS deviation from optimal for the remaining groups (if that's a metric you care about). Combining this "expected sum" concept with larsmans answer, you should have enough information to arrive at a fast approximate answer. Nothing optimal about it, but far better than random and with a nicely bounded run time.
Do you know how many groups you need to split it into ahead of time?
Do you have some limit to the maximum size of a group?
A few algorithms for variations of this problem:
Knuth's word-wrap algorithm
algorithms minimizing the number of floppies needed to store a set of files, but keeping any one file immediately readable from the disk it is stored on (rather than piecing it together from fragments stored on 2 disks) -- I hear that "copy to floppy with best fit" was popular.
Calculating a cutting list with the least amount of off cut waste. Calculating a cutting list with the least amount of off cut waste
What is a good algorithm for compacting records in a blocked file? What is a good algorithm for compacting records in a blocked file?
Given N processors, how do I schedule a bunch of subtasks such that the entire job is complete in minimum time? multiprocessor scheduling.

Resources