Subsetting multiple variables by rank within data-frame - rstudio

Hopefully the below makes sense.
I have a data set a large number of variables (row). Within each variable are sections that are scored 1-15. I need to subset the dataframe based on the three highest scoring sections for each variable. Each section has additional data associated with it that would be needed, but is not required as part of the selection.
Having trouble with this. Any help is appreciated.
Dummy layout below
Variable Aux_score
1 1
1 6
1 3
1 8
1 10
2 3
2 2
2 12
2 10
2 11
3 7
3 2
3 9
3 8
3 12

You can do it like this with base r:
do.call(rbind, lapply(split(df, df$Variable), function(df) df[ tail(order(df$Aux_score), 3), ]))
Or like this with tidyverse:
df %>% group_by(Variable) %>% top_n(3, Aux_score) %>% ungroup()

Related

Algorithm to distribute evenly products value into care packages

i'm currently solving a problem that states:
A company filed for bankruptcy and decided to pay the employees with the last remaining valuable items in the company only if it can be distributed evenly among them so that all of them have at least received 1 item and that the difference between the employee carrying the most valuable items and the employee carrying the least valuable items can not exceed a certain value x;
Input:
First row contains number of employee;
Second row contains the x value so that the the difference between the employee carrying the most valuable items and the employee carrying the least valuable items can not exceed;
Third row contains all the items with their value;
Output:
First number is the least valuable basket of items value and the second is the most valuable basket;
Example:
Input:
5
4
2 5 3 11 4 3 1 15 7 8 10
Output:
13 15
Input:
5
4
1 1 1 11 1 3 1 2 7 8
Output:
NO (It's impossible to distribute evenly)
Input:
5
10
1 1 1 1
Output:
NO (It's impossible to distribute evenly)
My solution to resolve this problem taking the first input is to, sort the items in ascending or descending order so from
2 5 3 11 4 3 1 15 7 8 10 --> 1 2 3 3 4 5 7 8 10 11 15
then create an adjacency list or just store it in simple variables where we add the biggest number to the lowest basket while iterating the item values array
Element 0: 15
Element 1: 11 <- 3 (sum 14)
Element 2: 10 <- 3 (sum 13)
Element 3: 8 <- 4 <- 1 (sum 13)
Element 4: 7 <- 5 <- 2 (sum 14)
So that my solution will have O(nlogN + 2n), first part using merge sort and then finding max e min value, what do you guys think about this solution?

Ranking over each matrix column's sort in julia

I have a matrix (m) of scores for 4 students on 3 different exams.
4 3 1
3 2 5
8 4 6
1 5 2
I want to know, for each student, the exams they did best to worse on. Desired output:
1 2 3
2 3 1
1 3 2
3 1 2
Now, I'm new to the language (and coding in general), so I read GeeksforGeeks' page on sorting in Julia and tried
mapslices(sortperm, -m; dims = 2)
However, this gives something subtly different: a matrix of each row being the index of the sorting.
1 2 3
3 1 2
1 3 2
2 3 1
Perhaps it was obvious, but I now realize this is not actually what I want, but I cannot find a built-in function/fast way to complete this operation. Any ideas? Preferably something which doesn't iterate through items in the matrix/row, as in reality my matrix is very, very large. Thanks!
Such functionality is provided by StatsBase.jl. Here is an example:
julia> using StatsBase
julia> m = [4 3 1
3 2 5
8 4 6
1 5 2]
4×3 Array{Int64,2}:
4 3 1
3 2 5
8 4 6
1 5 2
julia> mapslices(x -> ordinalrank(x, rev=true), m, dims = 2)
4×3 Array{Int64,2}:
1 2 3
2 3 1
1 3 2
3 1 2
You might want to use other rank, depending on how you want to split ties, see here for details.
Figured out something which works!
Run m_index_rank = mapslices(sortperm, -m; dims = 2) on the matrix and get a ranking for each row through index. Then, realizing this is, in each row, an inverse permutation away from the desired output, run mapslices(invperm, m_index_rank; dims = 2) for the desired result.
In one line, this is mapslices(r -> invperm(sortperm(r, rev=true)), m; dims=2) over the desired matrix m. dims = 2 is to carry out the operation row-wise.
I'm marking this resolved for now, but please let me know if there are cleaner/faster ways to do this.
Edit: Replaced my syntactically clunky mapslices(invperm, mapslices(sortperm, -m; dims = 2); dims = 2) with a more natural one, thanks to #phipsgabler

Making a scatterplot with PCA and how to read results

Im a little newbie with R and not familiar with PCA. My problem is, from a survey I have a list with observations from nine variables, first one is the gender of the respondents, the next five (Q51_1_c,Q51_2_c,Q51_4_c,Q51_6_c,Q51_7_c) ask about entrepreneurial issues and the others ask about future expectations (Q56_1_c, Q56_2_c, Q56_3_c). Except gender, all this variables takes values between 1 and 5. I want to make a scatter plot with two axis. First one with "entrepreneurial variables" and second axis with "future expectations variables" and then define as points in the scatter plot the position of Male and Female. My data look like this:
x <- "Q1b Q51_1_c Q51_2_c Q51_4_c Q51_6_c Q51_7_c Q56_1_c Q56_2_c Q56_3_c
3 Male 5 4 4 4 4 5 4 4
4 Female 4 3 4 4 3 3 4 3
5 Female 1 1 1 1 1 3 1 1
7 Female 2 1 1 1 1 5 1 4
8 Female 4 4 5 4 4 5 4 4
9 Female 3 3 4 4 3 3 4 4
13 Male 4 4 4 4 5 3 3 3
15 Female 3 4 4 4 4 1 1 5
16 Female 4 1 4 4 4 3 3 3
19 Female 3 2 3 3 3 3 3 3
20 Male 1 1 1 1 1 3 1 5
21 Female 3 1 1 2 1 3 3 3
26 Female 5 5 1 2 1 4 4 3
27 Female 2 1 1 1 1 1 1 1
29 Male 2 2 2 2 1 4 4 4
31 Female 3 1 1 1 1 5 2 3
34 Female 4 1 1 4 3 3 1 4
36 Female 5 1 1 4 4 5 1 2
37 Male 5 1 2 4 4 5 4 5
38 Female 3 1 1 1 1 1 1 1"
To run PCA this is my code:
x <- na.omit(x) #Jus to simplyfy
resul <- prcomp(x[,-1], scale = TRUE)
x$PC1 <- resul$x[,1] #Saving Scores PC1
x$PC2 <- resul$x[,2] #Saving Scores PC2
The result axis are like this:
biplot(resul, scale = 0)
Finally, to make the scatter plot:
x %>%
group_by(Q1b) %>%
summarise(mean_PC1 = mean(PC1),
mean_PC2 = mean(PC2)) %>%
ggplot(aes(x=mean_PC1, y=mean_PC2, colour=Q1b)) +
geom_point() +
theme_bw()
Which gives me this:
I'm not sure how about read the results... Should I accept that Females in general get higher values in the dimension of future expectations than Males. And Males get higher values in the entrepreneurial dimension?
Thanks in advance!!
Your interpretation of the axes looks correct, i.e., PC1 is a gradient which from left to right represents decreasing "entrepreneurialness", while PC2 is a gradient which from bottom to top represents increasing future expectations (assuming that "5" in the original data means highest entrepreneurialness/expectations).
In terms of whether males and females are different, you probably need to plot more than the just the means for each group: even if males and females are truly identical in their entrepreneurialness/expectations, you'd never expect the means from two samples to sit right on top of each other on a scatter plot. To address this, you could plot the actual observations rather than their means (i.e., one point per row, coloured by gender) and see if they intermingle vs. separate in the plot space. Or, regress gender against the principal components.
Another issue is whether it's appropriate to use PCA on ordinal data - see here for discussion.

openblas sgemv CblasRowMajor implementation returns wrong results (cblas_sgemv)

I did some test using cblas_sgemv in openblas and found that it returned a wrong result in my test case.
A is
1 2
3 4
5 6
B is
1 2
The output C should be 5 11 17
But, it outputs 5 14 0
Here is the sample code.
https://docs.google.com/document/d/15mCkfcQuruQxi4CjvVkoK2jfgnG2w3izd0wMFMW6UOk/edit?usp=sharing
the lda parameter seems to be wrong. since the order is CblasRowMajor it should be 2 (number of columns) instead of 3 (number of rows).
cf. https://stackoverflow.com/a/30208420/6058571

Randomization of treatments in R studio

I want to get a randomization of treatments with three levels and sample size n = 15. I'm stuck in where
volunteers <- 1:15
set.seed(1); sample(volunteers, size=5, replace=F)
I want three different groups, five each, but I'm new to R.
This is a data setup for ANOVA, not a specific question which gives particular data sets. Also I don't know what it means for set.seed
I think you are looking for something like that:
set.seed(1337)
# replace with you real participants ids
volunteers <- 1:15
# set the number of groups
number.of.groups <- 1:3
# set group size
group.size <- 5
# generate data frame with participant > group order
df <- data.frame(group=sort(rep(number.of.groups,group.size)),
participant=sample(volunteers,length(volunteers)))
# show your groups
df[which(df$group==1),]
# group participant
# 1 1 9
# 2 1 8
# 3 1 1
# 4 1 6
# 5 1 5
df[which(df$group==2),]
# group participant
# 6 2 4
# 7 2 15
# 8 2 3
# 9 2 2
# 10 2 13
df[which(df$group==3),]
# group participant
# 11 3 11
# 12 3 10
# 13 3 14
# 14 3 12
# 15 3 7
And you only need to use set.seed() if want to be able to replicate your samples since this method causes that you always draw the same "random" samples. Consequently, set.seed() is more for testing than for real analysis code. The seed you set is by the way irrelevant. If you want to replicate just make sure to always set the same seed.
How about:
install.packages("randomizr")
library(randomizr)
Z <- complete_ra(15, num_arms = 3)
table(Z)
This gives
> table(Z)
Z
T1 T2 T3
5 5 5

Resources