I am trying to create a new column in my dataframe based on the maximum values across 3 columns. However, depending on the values within each row, I want it to sort for either the most negative value or the most positive value. If the average for an individual row across the 3 columns is greater than 0, I want it to report the most positive value. If it is less than 0, I want it to report back the most negative value.
Here is an example of the dataframe
A B C
-0.30 -0.45 -0.25
0.25 0.43 0.21
-0.10 0.10 0.25
-0.30 -0.10 0.05
And here is the desired output
A B C D
-0.30 -0.45 -0.25 -0.45
0.25 0.43 0.21 0.43
-0.10 0.10 0.25 0.25
-0.30 -0.10 0.05 -0.30
I had first tried playing around with something like
data %>%
mutate(D = pmax(abs(A), abs(B), abs(C)))
But that just returns a column with the greatest of the absolute values where everything is positive.
Thanks in advance for your help, and apologies if the formatting of the question is off, I don't use this site a lot. Happy to clarify anything as well.
In a trick-taking game, it is often easy to keep track of which cards each player can possibly have left. For instance if following suit is mandatory and a player does not follow suit, it is obvious that player does not have any more cards of that particular suit.
This means, during the game you can build up knowledge about which cards each player can possibly have.
Is there a way to efficiently calculate (a reasonably accurate) chance that a specific player actually has a certain card?
A naive way would be to just generate all permutations of all cards left and check which of these permutations are possible given the constraints mentioned earlier. But this is not a really efficient way.
Another approach would be to just check how many others could have a particular card. For instance, if 3 players might have a particular card you could use 1/3 as the chance a particular player has a certain card. But this is often inaccurate.
For instance:
Each player has 2 cards left
Player A can have the AS, KS.
Player B can have the AS, KS, AH, and KH.
Algorithm 1 would correctly find that the chance Player B has the AS is 0.
Algorithm 2 would incorrectly find that the chance Player B has the AS is 0.5.
Is there a better algorithm that would be both reasonably accurate and reasonably fast?
Take a page from a book of quantum mechanics. Consider that every card is in a mix of states with probabilities - e.g. x|AS>+y|KS>+z|AH>+w|KH>. For 36 cards, you get 36 x 36 matrix, where initially all values are equal 1/36. Constraints are that sum of all values in a row equals 1 (every card is somewhere) and sum of all values in a column is 1 (every card is something). For your mini-example, initial matrix would be
0.25 0.25 0.25 0.25 (AS)
0.25 0.25 0.25 0.25 (KS)
0.25 0.25 0.25 0.25 (AH)
0.25 0.25 0.25 0.25 (KH)
(0) (1) (2) (3)
Let A cards be 0, 1 and B cards be 2, 3. Chance of B having AS is 0.5.
Now you observe that P(0 = AH) = 0, then you set corresponding element to 0 and proportionally alter column and row values, then all other values so that sums remain 1:
0.33 0.22 0.22 0.22 (AS)
0.33 0.22 0.22 0.22 (KS)
0.00 0.33 0.33 0.33 (AH)
0.33 0.22 0.22 0.22 (KH)
(0) (1) (2) (3)
Adding observations P(0 = KH) = 0, P(1 = AH) = 0, P(1 = KH) = 0 gets you this matrix:
0.50 0.50 0.00 0.00 (AS)
0.50 0.50 0.00 0.00 (KS)
0.00 0.00 0.50 0.50 (AH)
0.00 0.00 0.50 0.50 (KH)
(0) (1) (2) (3)
As you can see, P(2 = AS or 3 = AS) = 0, as it should be.
Note that most games allow the player to shuffle the cards in his or her hand (i.e. when B plays a card, you don't know if it's (2) or (3)). Suppose A and B exchange cards (1) and (2) - this leaves matrix the same - and then when B shuffles his cards, the matrix becomes
0.50 0.25 0.00 0.25 (AS)
0.50 0.25 0.00 0.25 (KS)
0.00 0.25 0.50 0.25 (AH)
0.00 0.25 0.50 0.25 (KH)
(0) (1) (2) (3)
Also note that the model isn't perfect - it doesn't allow to note observations like "B has either (AS, KH) or (AH, KS)". But in certain definitions of "reasonably accurate", it probably is.
I have a 200x200 correlation matrix text file that I would like to turn into a single row.
e.g.
a b c d e
a 1.00 0.33 0.34 0.26 0.20
b 0.33 1.00 0.40 0.48 0.41
c 0.34 0.40 1.00 0.59 0.35
d 0.26 0.48 0.59 1.00 0.43
e 0.20 0.41 0.35 0.43 1.00
I want to turn it into:
a_b a_c a_d a_e b_c b_d b_e c_d c_e d_e
0.33 0.34 0.26 0.20 0.40 0.48 0.41 0.59 0.35 0.43
I need a code that can:
1. Join the variable names to make a single row of headers (e.g. turn "a" and "b" into "a_b") and
2. Turn only one half of the correlation matrix (bottom or top triangle) into a single row
A bit of extra information: I have around 500 participants in a study and each of them has a correlation matrix file. I want to consolidate these separate data files into one file where each row is one participant's correlation matrix.
Does anyone know how to do this?
Thanks!!
I have been told that in order to calculate the expected residence time for a set of states I can use the following approach:
Construct a Markov Chain with index i,j being the probability of transition from state i to state j.
Transpose the matrix, so that each column contains the inbound probabilities for that state.
Invert the diagonal so that a value p becomes (1-p).
Add a row at the bottom, containing 1's
Construct a coefficient vector with 0's and the last element 1
Solve it. The resulting vector should contain the expected residence time for the various states
Let me give an example:
I have the initial Markov Chain:
0.25 ; 0.25 ; 0.25 ; 0.25
0.00 ; 0.50 ; 0.50 ; 0.00
0.33 ; 0.33 ; 0.33 ; 0.00
0.00 ; 0.00 ; 0.50 ; 0.50
After step 1-3 it looks like this:
0.75 ; 0.00 ; 0.33 ; 0.00
0.25 ; 0.50 ; 0.33 ; 0.00
0.25 ; 0.50 ; 0.67 ; 0.50
0.25 ; 0.00 ; 0.00 ; 0.50
I add the last line:
0.75 ; 0.00 ; 0.33 ; 0.00
0.25 ; 0.50 ; 0.33 ; 0.00
0.25 ; 0.50 ; 0.67 ; 0.50
0.25 ; 0.00 ; 0.00 ; 0.50
1.00 ; 1.00 ; 1.00 ; 1.00
The coefficient will be the following vector:
0 ; 0 ; 0 ; 0 ; 1
The added line of 1's should enforce, that the solution sums to 1. However, my solution is the set:
{0.42; 0.84; -0.79; 0.32}
Which sums to 0.79, so clearly something is wrong.
I also note, that the expected residence time of state 3 is negative, which in my mind should not be possible.
I have it implemented in Java and I use Commons.Math to handle the matrix calculations. I have tried the various algorithms described in the documentation, but I get the same result.
I have also tried to substitute one of the rows with the line of 1's in order to make the matrix square. When I do that, I get the following set of solutions:
{0.79; 0.79; -1.79; 1.2}
Even though the probabilities sum to 1 they must still be wrong as they should be in the range 0..1 AND sum to 1.
Is this an entirely wrong approach to the problem? Where am I off?
Unfortunately I am not very mathematical, but I hope I have given enough information for you to see the problem.
I found the answer:
Let all probabilities p but the diagonal be -p in step 3:
0.75 ; -0.00 ; -0.33 ; -0.00
-0.25 ; 0.50 ; -0.33 ; -0.00
-0.25 ; -0.50 ; 0.67 ; -0.50
-0.25 ; -0.00 ; -0.00 ; 0.50
I've begun to believe that data frames hold no advantages over matrices, except for notational convenience. However, I noticed this oddity when running unique on matrices and data frames: it seems to run faster on a data frame.
a = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
b = as.data.frame(a)
system.time({
u1 = unique(a)
})
user system elapsed
1.840 0.000 1.846
system.time({
u2 = unique(b)
})
user system elapsed
0.380 0.000 0.379
The timing results diverge even more substantially as the number of rows is increased. So, there are two parts to this question.
Why is this slower for a matrix? It seems faster to convert to a data frame, run unique, and then convert back.
Is there any reason not to just wrap unique in myUnique, which does the conversions in part #1?
Note 1. Given that a matrix is atomic, it seems that unique should be faster for a matrix, rather than slower. Being able to iterate over fixed-size, contiguous blocks of memory should generally be faster than running over separate blocks of linked lists (I assume that's how data frames are implemented...).
Note 2. As demonstrated by the performance of data.table, running unique on a data frame or a matrix is a comparatively bad idea - see the answer by Matthew Dowle and the comments for relative timings. I've migrated a lot of objects to data tables, and this performance is another reason to do so. So although users should be well served to adopt data tables, for pedagogical / community reasons I'll leave the question open for now regarding the why does this take longer on the matrix objects. The answers below address where does the time go, and how else can we get better performance (i.e. data tables). The answer to why is close at hand - the code can be found via unique.data.frame and unique.matrix. :) An English explanation of what it's doing & why is all that is lacking.
In this implementation, unique.matrix is the same as unique.array
> identical(unique.array, unique.matrix)
[1] TRUE
unique.array has to handle multi-dimensional arrays which requires additional processing to ‘collapse’ the extra dimensions (those extra calls to paste()) which are not needed in the 2-dimensional case. The key section of code is:
collapse <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)
temp <- if (collapse)
apply(x, MARGIN, function(x) paste(x, collapse = "\r"))
unique.data.frame is optimised for the 2D case, unique.matrix is not. It could be, as you suggest, it just isn't in the current implementation.
Note that in all cases (unique.{array,matrix,data.table}) where there is more than one dimension it is the string representation that is compared for uniqueness. For floating point numbers this means 15 decimal digits so
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))
is 1 while
NROW(unique(a <- matrix(rep(c(1, 1+5e-15), 2), nrow = 2)))
and
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))
are both 2. Are you sure unique is what you want?
Not sure but I guess that because matrix is one contiguous vector, R copies it into column vectors first (like a data.frame) because paste needs a list of vectors. Note that both are slow because both use paste.
Perhaps because unique.data.table is already many times faster. Please upgrade to v1.6.7 by downloading it from the R-Forge repository because that has the fix to unique you raised in this question. data.table doesn't use paste to do unique.
a = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
b = as.data.frame(a)
system.time(u1<-unique(a))
user system elapsed
2.98 0.00 2.99
system.time(u2<-unique(b))
user system elapsed
0.99 0.00 0.99
c = as.data.table(b)
system.time(u3<-unique(c))
user system elapsed
0.03 0.02 0.05 # 60 times faster than u1, 20 times faster than u2
identical(as.data.table(u2),u3)
[1] TRUE
In attempting to answer my own question, especially part 1, we can see where the time is spent by looking at the results of Rprof. I ran this again, with 5M elements.
Here are the results for the first unique operation (for the matrix):
> summaryRprof("u1.txt")
$by.self
self.time self.pct total.time total.pct
"paste" 5.70 52.58 5.96 54.98
"apply" 2.70 24.91 10.68 98.52
"FUN" 0.86 7.93 6.82 62.92
"lapply" 0.82 7.56 1.00 9.23
"list" 0.30 2.77 0.30 2.77
"!" 0.14 1.29 0.14 1.29
"c" 0.10 0.92 0.10 0.92
"unlist" 0.08 0.74 1.08 9.96
"aperm.default" 0.06 0.55 0.06 0.55
"is.null" 0.06 0.55 0.06 0.55
"duplicated.default" 0.02 0.18 0.02 0.18
$by.total
total.time total.pct self.time self.pct
"unique" 10.84 100.00 0.00 0.00
"unique.matrix" 10.84 100.00 0.00 0.00
"apply" 10.68 98.52 2.70 24.91
"FUN" 6.82 62.92 0.86 7.93
"paste" 5.96 54.98 5.70 52.58
"unlist" 1.08 9.96 0.08 0.74
"lapply" 1.00 9.23 0.82 7.56
"list" 0.30 2.77 0.30 2.77
"!" 0.14 1.29 0.14 1.29
"do.call" 0.14 1.29 0.00 0.00
"c" 0.10 0.92 0.10 0.92
"aperm.default" 0.06 0.55 0.06 0.55
"is.null" 0.06 0.55 0.06 0.55
"aperm" 0.06 0.55 0.00 0.00
"duplicated.default" 0.02 0.18 0.02 0.18
$sample.interval
[1] 0.02
$sampling.time
[1] 10.84
And for the data frame:
> summaryRprof("u2.txt")
$by.self
self.time self.pct total.time total.pct
"paste" 1.72 94.51 1.72 94.51
"[.data.frame" 0.06 3.30 1.82 100.00
"duplicated.default" 0.04 2.20 0.04 2.20
$by.total
total.time total.pct self.time self.pct
"[.data.frame" 1.82 100.00 0.06 3.30
"[" 1.82 100.00 0.00 0.00
"unique" 1.82 100.00 0.00 0.00
"unique.data.frame" 1.82 100.00 0.00 0.00
"duplicated" 1.76 96.70 0.00 0.00
"duplicated.data.frame" 1.76 96.70 0.00 0.00
"paste" 1.72 94.51 1.72 94.51
"do.call" 1.72 94.51 0.00 0.00
"duplicated.default" 0.04 2.20 0.04 2.20
$sample.interval
[1] 0.02
$sampling.time
[1] 1.82
What we notice is that the matrix version spends a lot of time on apply, paste, and lapply. In contrast, the data frame version simple runs duplicated.data.frame and most of the time is spent in paste, presumably aggregating results.
Although this explains where the time is going, it doesn't explain why these have different implementations, nor the effects of simply changing from one object type to another.