How to sort for most negative values and most positive values across columns? - sorting

I am trying to create a new column in my dataframe based on the maximum values across 3 columns. However, depending on the values within each row, I want it to sort for either the most negative value or the most positive value. If the average for an individual row across the 3 columns is greater than 0, I want it to report the most positive value. If it is less than 0, I want it to report back the most negative value.
Here is an example of the dataframe
A B C
-0.30 -0.45 -0.25
0.25 0.43 0.21
-0.10 0.10 0.25
-0.30 -0.10 0.05
And here is the desired output
A B C D
-0.30 -0.45 -0.25 -0.45
0.25 0.43 0.21 0.43
-0.10 0.10 0.25 0.25
-0.30 -0.10 0.05 -0.30
I had first tried playing around with something like
data %>%
mutate(D = pmax(abs(A), abs(B), abs(C)))
But that just returns a column with the greatest of the absolute values where everything is positive.
Thanks in advance for your help, and apologies if the formatting of the question is off, I don't use this site a lot. Happy to clarify anything as well.

Related

Algorithm for optimal expected amount in a profit/loss game

I came upon the following question recently,
"You have a box which has G green and B blue coins. Pick a random coin, G gives a profit of +1 and blue a loss of -1. If you play optimally what is the expected profit."
I was thinking of using a brute force algorithm where I consider all possibilities of combinations of green and blue coins but I'm sure there must be a better solution for this (range of B and G was from 0 to 5000). Also what does playing optimally mean? Does it mean that if i pick all blue coins then I would continue playing till all green coins are also picked? If so then this means I shouldn't consider all possibilities of green and blue coins?
The "obvious" answer is to play whenever there's more green coins than blue coins. In fact, this is wrong. For example, if there's 999 green coins and 1000 blue coins, here's a strategy that takes an expected profit:
Take 2 coins
If GG -- stop with a profit of 2
if BG or GB -- stop with a profit of 0
if BB -- take all the remaining coins for a profit of -1
Since the first and last possibilities both occur with near 25% probability, your overall expectation is approximately 0.25*2 - 0.25*1 = 0.25
This is just a simple strategy in one extreme example that shows that the problem is not as simple as it first seems.
In general, the expectations with g green coins and b blue coins is given by a recurrence relation:
E(g, 0) = g
E(0, b) = 0
E(g, b) = max(0, g(E(g-1, b) + 1)/(b+g) + b(E(g, b-1) - 1)/(b+g))
The max in the final row occurs because if it's -EV to play, then you're better stopping.
These recurrence relations can be solved using dynamic programming in O(gb) time.
from fractions import Fraction as F
def gb(G, B):
E = [[F(0, 1)] * (B+1) for _ in xrange(G+1)]
for g in xrange(G+1):
E[g][0] = F(g, 1)
for b in xrange(1, B+1):
for g in xrange(1, G+1):
E[g][b] = max(0, (g * (E[g-1][b]+1) + b * (E[g][b-1]-1)) * F(1, (b+g)))
for row in E:
for v in row:
print '%5.2f' % v,
print
print
return E[G][B]
print gb(8, 10)
Output:
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1.00 0.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2.00 1.33 0.67 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3.00 2.25 1.50 0.85 0.34 0.00 0.00 0.00 0.00 0.00 0.00
4.00 3.20 2.40 1.66 1.00 0.44 0.07 0.00 0.00 0.00 0.00
5.00 4.17 3.33 2.54 1.79 1.12 0.55 0.15 0.00 0.00 0.00
6.00 5.14 4.29 3.45 2.66 1.91 1.23 0.66 0.23 0.00 0.00
7.00 6.12 5.25 4.39 3.56 2.76 2.01 1.34 0.75 0.30 0.00
8.00 7.11 6.22 5.35 4.49 3.66 2.86 2.11 1.43 0.84 0.36
7793/21879
From this you can see that the expectation is positive to play with 8 green and 10 blue coins (EV=7793/21879 ~= 0.36), and you even have positive expectation with 2 green and 3 blue coins (EV=0.2)
Simple and intuitive answer:
you should start off with an estimate for the total number of blue and green coins. After each pick you will update this estimate. If you estimate there are more blue coins than green coins at any point you should stop.
Example:
you start and you pick a coin. Its green so you estimate 100% of the coins are green. You pick a blue so you estimate 50% of coins are green. You pick another blue coin so you estimate 33% of the coins are green. At this point is isn't worth playing anymore, according to your estimate, so you stop.
This answer is wrong; see Paul Hankin's answer for counterexamples and a proper analysis. I leave this answer here as a learning example for all of us.
Assuming that your choice is only when to stop picking coins, you continue as long as G > B. That part is simple. If you start with G < B, then you never start drawing, and your gain is 0. For G = B, no strategy will get you a mathematical advantage; the gain there is also 0.
For the expected reward, take this in two steps:
(1) Expected value on any draw sequence. Do this recursively, figuring the chance of getting green or blue on the first draw, and then the expected values for the new state (G-1, B) or (G, B-1). You will quickly see that the expected value of any given draw number (such as all possibilities for the 3rd draw) is the same as the original.
Therefore, your expected value on any draw is e = (G-B) / (G+B). Your overall expected value is e * d, where d is the number of draws you choose.
(2) What is the expected number of draws? How many times do you expect to draw before G = B? I'll leave this as an exercise for the student, but note the previous idea of doing this recursively. You might find it easier to describe the state of the game as (extra, total), where extra = G-B and total = G+B.
Illustrative exercise: given G=4, B=2, what is the chance that you'll draw GG on the first two draws (and then stop the game)? What is the gain from that? How does that compare with the (4-2)/(4+2) advantage on each draw?

Chance a player has a card given set of possible cards per player

In a trick-taking game, it is often easy to keep track of which cards each player can possibly have left. For instance if following suit is mandatory and a player does not follow suit, it is obvious that player does not have any more cards of that particular suit.
This means, during the game you can build up knowledge about which cards each player can possibly have.
Is there a way to efficiently calculate (a reasonably accurate) chance that a specific player actually has a certain card?
A naive way would be to just generate all permutations of all cards left and check which of these permutations are possible given the constraints mentioned earlier. But this is not a really efficient way.
Another approach would be to just check how many others could have a particular card. For instance, if 3 players might have a particular card you could use 1/3 as the chance a particular player has a certain card. But this is often inaccurate.
For instance:
Each player has 2 cards left
Player A can have the AS, KS.
Player B can have the AS, KS, AH, and KH.
Algorithm 1 would correctly find that the chance Player B has the AS is 0.
Algorithm 2 would incorrectly find that the chance Player B has the AS is 0.5.
Is there a better algorithm that would be both reasonably accurate and reasonably fast?
Take a page from a book of quantum mechanics. Consider that every card is in a mix of states with probabilities - e.g. x|AS>+y|KS>+z|AH>+w|KH>. For 36 cards, you get 36 x 36 matrix, where initially all values are equal 1/36. Constraints are that sum of all values in a row equals 1 (every card is somewhere) and sum of all values in a column is 1 (every card is something). For your mini-example, initial matrix would be
0.25 0.25 0.25 0.25 (AS)
0.25 0.25 0.25 0.25 (KS)
0.25 0.25 0.25 0.25 (AH)
0.25 0.25 0.25 0.25 (KH)
(0) (1) (2) (3)
Let A cards be 0, 1 and B cards be 2, 3. Chance of B having AS is 0.5.
Now you observe that P(0 = AH) = 0, then you set corresponding element to 0 and proportionally alter column and row values, then all other values so that sums remain 1:
0.33 0.22 0.22 0.22 (AS)
0.33 0.22 0.22 0.22 (KS)
0.00 0.33 0.33 0.33 (AH)
0.33 0.22 0.22 0.22 (KH)
(0) (1) (2) (3)
Adding observations P(0 = KH) = 0, P(1 = AH) = 0, P(1 = KH) = 0 gets you this matrix:
0.50 0.50 0.00 0.00 (AS)
0.50 0.50 0.00 0.00 (KS)
0.00 0.00 0.50 0.50 (AH)
0.00 0.00 0.50 0.50 (KH)
(0) (1) (2) (3)
As you can see, P(2 = AS or 3 = AS) = 0, as it should be.
Note that most games allow the player to shuffle the cards in his or her hand (i.e. when B plays a card, you don't know if it's (2) or (3)). Suppose A and B exchange cards (1) and (2) - this leaves matrix the same - and then when B shuffles his cards, the matrix becomes
0.50 0.25 0.00 0.25 (AS)
0.50 0.25 0.00 0.25 (KS)
0.00 0.25 0.50 0.25 (AH)
0.00 0.25 0.50 0.25 (KH)
(0) (1) (2) (3)
Also note that the model isn't perfect - it doesn't allow to note observations like "B has either (AS, KH) or (AH, KS)". But in certain definitions of "reasonably accurate", it probably is.

How to transform a correlation matrix into a single row?

I have a 200x200 correlation matrix text file that I would like to turn into a single row.
e.g.
a b c d e
a 1.00 0.33 0.34 0.26 0.20
b 0.33 1.00 0.40 0.48 0.41
c 0.34 0.40 1.00 0.59 0.35
d 0.26 0.48 0.59 1.00 0.43
e 0.20 0.41 0.35 0.43 1.00
I want to turn it into:
a_b a_c a_d a_e b_c b_d b_e c_d c_e d_e
0.33 0.34 0.26 0.20 0.40 0.48 0.41 0.59 0.35 0.43
I need a code that can:
1. Join the variable names to make a single row of headers (e.g. turn "a" and "b" into "a_b") and
2. Turn only one half of the correlation matrix (bottom or top triangle) into a single row
A bit of extra information: I have around 500 participants in a study and each of them has a correlation matrix file. I want to consolidate these separate data files into one file where each row is one participant's correlation matrix.
Does anyone know how to do this?
Thanks!!

Hive percentile_approx function is broken, isn't it?

I am using Hive 1.2.1000.2.4.2.0-258.
There are 4850000+ rows in the table, 14511 rows of A between 73 and 74, and 3 cols- group_id, A and B.
Group_id is actually equal to 0.
Almost all of A and B are integers.
I was using the following scripts to find statistic summaries from a table:
select group_id, --group_id=0 a constant
percentile_approx(A , 0.5) as A_mdn,
percentile_approx(A , 0.25) as A_Q1,
percentile_approx(A , 0.75) as A_Q3,
percentile_approx(A , array(0.2,0.15, 0.1,0.05,0.025,0.001)) as A_i,
min(A) as min_A,
percentile_approx(B , 0.5) as B_mdn,
percentile_approx(B , 0.25) as B_Q1,
percentile_approx(B , 0.75) as B_Q3,
percentile_approx(B , array(0.8,0.85, 0.9, 0.95,0.975)) as B_i
from table
group by group_id;
The result I got is:
0
73.21058033222496
73.21058033222496
462.16968382794516
[73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496]
0.0
1.0
1.0
2.0
[2.0,3.0,4.0,8.11278644563614,17.0]
Then I change the code as following:
select group_id, --group_id=0 a constant
percentile(cast(A as bigint), 0.5) as A_mdn,
percentile(cast(A as bigint), 0.25) as A_Q1,
percentile(cast(A as bigint), 0.75) as A_Q3,
percentile(cast(A as bigint), array(0.2,0.15, 0.1,0.05,0.025,0.001)) as A_i,
min(A) as min_A,
percentile(cast(B as bigint), 0.5) as B_mdn,
percentile(cast(B as bigint), 0.25) as B_Q1,
percentile(cast(B as bigint), 0.75) as B_Q3,
percentile(cast(B as bigint), array(0.8,0.85, 0.9, 0.95,0.975)) as B_i
from table
group by group_id
The new result is:
0
72.0
6.0
762.0
[3.0,1.0,1.0,0.0,0.0,0.0]
0.0
1.0
1.0
2.0
[2.0,3.0,4.0,9.0,17.0]
To double check the truth, I also load this table to R. Following is the R-result:
A:
Min 0
Q1: 6
Median: 72
Q3: 762
0.2 quantile: 3
0.15 quantile: 1.5
0.1 quantile: 1
0.05 quantile: 0
0.025 quantile:0
0.001 quantile:0
B
Q1: 1
Median: 1
Q3: 2
0.8 quantile: 2
0.85 quantile: 3
0.9 quantile: 4
0.95 quantile: 9
0.975 quantile:17
Obviously, R result is consistent with percentile function, but percentile_approx gives me the wrong answer.
Yeah, the percentile_approx doesn't have any approximation guarantees, except when you set the accuracy to be greater than or equal to the # of data points.
The source for it is here: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java
From a quick reading, the gist is that it creates accuracy buckets, and then when it runs out of buckets it merges buckets by finding the two closest buckets and combining them with a weighted sum.
This will break with various inputs though. In particular, if you have datapoints that are very high/very low and are spaced far away from each other, it will break the algorithm. If you first clip your data to be in a range where there are not many outliers, it should perform better.
You might consider randomly sampling the data and computing non-approx percentile instead if your data is too skewed though.
This function returns a true value if "all" the values are integers. You said that almost all of A and B are integers.
Try to cast the complete column A to int and see if you come close to the answer.
I don't think, you will ever get exactly the same answer as R because R's percentile function most likely takes non-integers also.
One way to get the exact answer would be to write your own UDF and use it instead.
Hope this helps!

Multiple palettes and empty labels from file entries using matrix with image in gnuplot

I have a file with a 4x4 score matrix and I'd like to plot the upper triangular with one color palette and the lower triangular with a different one, overlaying the score value (MWE at the bottom).
The original file looks like this
0.00 0.65 0.65 0.25
0.25 0.00 0.75 0.25
0.50 0.60 0.00 0.25
0.75 0.25 0.10 0.00
First, I created two separate files and used multiplot to have 2 different palettes.
FILE1 (upper triangular)
0.00 0.65 0.65 0.25
nan 0.00 0.75 0.25
nan nan 0.00 0.25
nan nan nan 0.00
FILE2 (lower triangular)
0.00 nan nan nan
0.25 0.00 nan nan
0.50 0.60 0.00 nan
0.75 0.25 0.10 0.00
Second, I plot the score values with
using 1:2:( sprintf('%.2f', $3 ) )
However, the 'nan' isn't interpreted as blank/empty and skipped but written onto the plot.
Any idea how to skip the nans and make gnuplot plot empty labels from individual entries of the data files?
The ternary operator in the following fashion do not seem to do the job
using 1:2:( $3 == 'nan' ? 1/0 : sprintf('%.2f', $3 ))
Thanks.
set multiplot
set autoscale fix
unset key
set datafile missing "nan"
set cbrange [0:1]
unset colorbox
set palette defined (0 "white", 0.1 "#9ecae1", 1.0 "#3182bd")
plot FILE1 matrix with image, \
FILE1 matrix using 1:2:( sprintf('%.2f', $3) ) with labels font ',16'
set palette defined (0 "white", 0.1 "#a1d99b", 1.0 "#31a354")
plot FILE2 matrix with image, \
FILE2 matrix using 1:2:( sprintf('%.2f', $3) ) with labels font ',16'
unset multiplot
You don't need to use multiplot and two separate files (I also couldn't get this working with the labels).
Just define a single palette, which contains as negative values one palette and as positive values the other palette. Based on the x and y-value from the single file you show first, you can now distinguish if the color value should be taken from the negative or from the positive palette part:
set autoscale fix
set cbrange [-1:1]
unset colorbox
unset key
set palette defined (-1.0 "#31a354", -0.1 "#a1d99b", 0 "white", 0.1 "#9ecae1", 1.0 "#3182bd")
plot 'FILE' matrix using 1:2:($1<$2 ? -$3 : $3) with image,\
'' matrix using 1:2:(sprintf('%.2f', $3)) with labels font ',16'

Resources