How to subtract elements in a DataFrame - sparkr

In SparkR I have a DataFrame data contains id, amount_spent and amount_won.
For example for id=1 we have
head(filter(data, data$id==1))
and output is
1 30 10
1 40 100
1 22 80
1 14 2
So far I want to know if a fixed id has more won than losses. The amount can be ignored.
In R I can make it to run but it takes time. Say we have 100 id's. In R I have done this
w=c()
for(j in 1:100){
# Making it local for a fixed id
q=collect(filter(data, data$id==j))
# Checking the difference. 1 means wins and 0 means losses
if( as.numeric(q$amount_won) - as.numeric(q$amount_spent)>0 {
w[j]=1
}
else{w[j]=0}
}
Now w simply gives me 1's and 0's for all the id's. In sparkR I want to do this a more faster way.

I am not sure wether this is exactly what you want, so feel free to ask for adjustments.
df <- data.frame(id = c(1,1,1,1),
amount_spent = c(30,40,22,14),
amount_won = c(10,100,80,2))
DF <- createDataFrame(sqlContext, df)
DF <- withColumn(DF, "won", DF$amount_won > DF$amount_spent)
DF$won <- cast(DF$won, "integer")
grouped <- groupBy(DF, DF$id)
aggregated <- agg(grouped, total_won = sum(DF$won), total_games = n(DF$won))
result <- withColumn(aggregated, "percentage_won" , aggregated$total_won/aggregated$total_games)
collect(result)
I have added a column to DF whether the ID has won more than he spent on that row. The result has as output the amount of games someone played, the amount of games he won and the percentage of games he won.

Related

How to speed up row-specific operation based on values of other variables

Say I have this data:
sysuse auto2, clear
keep if _n<=4
describe
local N = r(N)
gen a1 = price
gen a2 = mpg
gen a3 = headroom
gen a4 = trunk
gen a5 = weight
gen a6 = length
input yearA yearB
1 4
1 5
2 5
1 6
keep a1-a6 yearA yearB
I'd like to do a row-specific operation based on the value of other variables. As an example, I'd like to add up all a columns corresponding to some row-specific rule, in this case starting a year after yearA and a year before yearB. So, if yearA==1 and yearB==5, the starting year is 2 and the end year is 4, so we would add a2, a3, and a4 together to get that row's total. Each row has its own rule corresponding to (a function of) its values of yearA and yearB.
I came up with the following solution, which works, but it is clunky and slow:
gen total = .
forvalues i = 1/`N' {
local start = yearA[`i']+1
local end = yearB[`i']-1
display "`start' `end'"
*annoyingly, you can't replace with egen, so create a new variable and delete it
egen total`i' = rowtotal(a`start'-a`end')
replace total = total`i' if _n==`i'
drop total`i'
}
As noted in the comment in the loop, I resorted to creating a new variable for each row and deleting it after using its value. Why? Because it doesn't seem like one can use replace with egen.
The actual application creates multiple variables and there are millions of observations, so it takes many hours or even days to run. What is a faster way to accomplish my goal? I am in now tied to doing things row-by-row if there is a better way.
gen wanted = 0
forval j = 1/6 {
replace wanted = wanted + a`j' if inrange(`j', yearA + 1, yearB - 1)
}

Python: break up dataframe (one row per entry in column, instead of multiple entries in column)

I have a solution to a problem, that to my despair is somewhat slow, and I am seeking advice on how to speed up my solution (by adding vectorization or other clever methods). I have a dataframe that looks like this:
toy = pd.DataFrame([[1,'cv','c,d,e'],[2,'search','a,b,c,d,e'],[3,'cv','d']],
columns=['id','ch','kw'])
Output is:
The task is to break up kw column into one (replicated) row per comma-separated entry in each string. Thus, what I wish to achieve is:
My initial solution is the following:
data = pd.DataFrame()
for x in toy.itertuples():
id = x.id; ch = x.ch; keys = x.kw.split(",")
data = data.append([[id, ch, x] for x in keys], ignore_index=True)
data.columns = ['id','ch','kw']
Problem is: it is slow for larger dataframes. My hope is that someone has encountered a similar problem before, and knows how to optimize my solution. I'm using python 3.4.x and pandas 0.19+ if that is of importance.
Thank you!
You can use str.split for lists, then get len for length.
Last create new DataFrame by constructor with numpy.repeat and numpy.concatenate:
cols = toy.columns
splitted = toy['kw'].str.split(',')
l = splitted.str.len()
toy = pd.DataFrame({'id':np.repeat(toy['id'], l),
'ch':np.repeat(toy['ch'], l),
'kw':np.concatenate(splitted)})
toy = toy.reindex_axis(cols, axis=1)
print (toy)
id ch kw
0 1 cv c
0 1 cv d
0 1 cv e
1 2 search a
1 2 search b
1 2 search c
1 2 search d
1 2 search e
2 3 cv d

R code: Efficiency Issue

I have a vector of length 14 and I would like to check in sets of 5 in this manner:
compare = c(rep(1,4),rep(0,10)) # Vector
g.test = matrix(0,5,10)
for (i in 5:14){
g.test[,i-4] = head(tail(compare,i),5)
}
if (sum(colSums(g.test) >= 3 & colSums(g.test) < 5 ) > 0){yield = T}
I am running through the vector
compare[c(10:14)] to compare[c(9:13)] to ... to compare[c(1:5)] and checking if any of it has a sum >= 3 and < 5.
BUT, compare is just 1 such vector; I've 100,000 such vector of different permutations of 1's and 0's but all of length 14. running my code like that took my computer 100 seconds to run through. Is there a better way to do this?
I'm actually running a simulation test for Texas poker. This portion of the code is used to check for incomplete straight draws.
Try this:
g.sums <- rowSums(embed(compare, 5))
yield <- any(g.sums >= 3 & g.sums < 5)
100,000 iterations on my machine:
# user system elapsed
# 2.438 0.052 2.493

What is an efficient way to convert sets to a column index in R?

Overview
Give a large (nrows > 5,000,000+) data frame, A, with string row names and a list of disjoint sets (n = 20,000+), B, where each set consists of row names from A, what is the best way to create a vector representing the sets in B via a unique value?
Illustration
Below is an example illustrating this problem:
# Input
A <- data.frame(d = rep("A", 5e6), row.names = as.character(sample(1:5e6)))
B <- list(c("4655297", "3177816", "3328423"), c("2911946", "2829484"), ...) # Size 20,000+
The desired result would be:
# An index of NA represents that the row is not part of any set in B.
> A[,"index", drop = F]
d index
4655297 A 1
3328423 A 1
2911946 A 2
2829484 A 2
3871770 A NA
2702914 A NA
2581677 A NA
4106410 A NA
3755846 A NA
3177816 A 1
Naive Attempt
Something like this can be achieved using the following method.
n <- 0
A$index <- NA
lapply(B, function(x){
n <<- n + 1
A[x, "index"] <<- n
})
Problem
However this is unreasonably slow (several hours) due to indexing A multiple times and is not very R-esque or elegant.
How can the desired result be generated in a quick and efficient manner?
Here is a suggestion using base that isn't too bad when compared to your current method.
Sample data:
A <- data.frame(d = rep("A", 5e6),
set = sample(c(NA, 1:20000), 5e6, replace = TRUE),
row.names = as.character(sample(1:5e6)))
B <- split(rownames(A), A$set)
Base method:
system.time({
A$index <- NA
A[unlist(B), "index"] <- rep(seq_along(B), times = lapply(B, length))
})
# user system elapsed
# 15.30 0.19 15.50
Check:
identical(A$set, A$index)
# TRUE
For anything faster, I suppose data.table will come handy.

r: for loop operation with nested indices runs super slow

I have an operation I'd like to run for each row of a data frame, changing one column. I'm an apply/ddply/sqldf man, but I'll use loops when they make sense, and I think this is one of those times. This case is tricky because the column to changes depends on information that changes by row; depending on information in one cell, I should make a change to only one of ten other cells in that row. With 75 columns and 20000 rows, the operation takes 10 minutes, when every other operation in my script takes 0-5 seconds, ten seconds max. I've stripped my problem down to the very simple test case below.
n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time(
for (i in 1:nrow(t.df)) {
t.df[i,(t.df[i,1]%%10 + 1)] <- 99
}
)
This takes 70 seconds with ten columns, and 360 when ncol=50. That's crazy. Are loops the wrong approach? Is there a better, more efficient way to do this?
I already tried initializing the nested term (t.df[i,1]%%10 + 1) as a list outside the for loop. It saves about 30 seconds (out of 10 minutes) but makes the example code above more complicated. So it helps, but its not the solution.
My current best idea came while preparing this test case. For me, only 10 of the columns are relevant (and 75-11 columns are irrelevant). Since the run times depend so much on the number of columns, I can just run the above operation on a data frame that excludes irrelevant columns. That will get me down to just over a minute. But is "for loop with nested indices" even the best way to think about my problem?
It seems the real bottleneck is having the data in the form of a data.frame. I assume that in your real problem you have a compelling reason to use a data.frame. Any way to convert your data in such a way that it can remain in a matrix?
By the way, great question and a very good example.
Here's an illustration of how much faster loops are on matrices than on data.frames:
> n <- 20000
> t.df <- (matrix(1:5000, ncol=10, nrow=n) )
> system.time(
+ for (i in 1:nrow(t.df)) {
+ t.df[i,(t.df[i,1]%%10 + 1)] <- 99
+ }
+ )
user system elapsed
0.084 0.001 0.084
>
> n <- 20000
> t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
> system.time(
+ for (i in 1:nrow(t.df)) {
+ t.df[i,(t.df[i,1]%%10 + 1)] <- 99
+ }
+ )
user system elapsed
31.543 57.664 89.224
Using row and col seems less complicated to me:
t.df[col(t.df) == (row(t.df) %% 10) + 1] <- 99
I think Tommy's is still faster, but using row and col might be easier to understand.
#JD Long is right that if t.df can be represented as a matrix, things will be much faster.
...And then you can actually vectorize the whole thing so that it is lightning fast:
n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time({
m <- as.matrix(t.df)
m[cbind(seq_len(nrow(m)), m[,1]%%10L + 1L)] <- 99
t2.df <- as.data.frame(m)
}) # 0.00 secs
Unfortunately, the matrix indexing I use here does not seem to work on a data.frame.
EDIT
A variant where I create a logical matrix to index works on data.frame, and is almost as fast:
n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time({
t2.df <- t.df
# Create a logical matrix with TRUE wherever the replacement should happen
m <- array(FALSE, dim=dim(t2.df))
m[cbind(seq_len(nrow(t2.df)), t2.df[,1]%%10L + 1L)] <- TRUE
t2.df[m] <- 99
}) # 0.01 secs
UPDATE: Added the matrix version of Tommy's solution to the benchmarking exercise.
You can vectorize it. Here is my solution and a comparison with the loop
n <- 20000
t.df <- (matrix(1:5000, ncol=10, nrow=n))
f_ramnath <- function(x){
idx <- x[,1] %% 10 + 1
x[cbind(1:NROW(x), idx)] <- 99
return(x)
}
f_long <- function(t.df){
for (i in 1:nrow(t.df)) {
t.df[i,(t.df[i,1]%%10 + 1)] <- 99
}
return(t.df)
}
f_joran <- function(t.df){
t.df[col(t.df) == (row(t.df) %% 10) + 1] <- 99
return(t.df)
}
f_tommy <- function(t.df){
t2.df <- t.df
# Create a logical matrix with TRUE wherever the replacement should happen
m <- array(FALSE, dim=dim(t2.df))
m[cbind(seq_len(nrow(t2.df)), t2.df[,1]%%10L + 1L)] <- TRUE
t2.df[m] <- 99
return(t2.df)
}
f_tommy_mat <- function(m){
m[cbind(seq_len(nrow(m)), m[,1]%%10L + 1L)] <- 99
}
To compare the performance of the different approaches, we can use rbenchmark.
library(rbenchmark)
benchmark(f_long(t.df), f_ramnath(t.df), f_joran(t.df), f_tommy(t.df),
f_tommy_mat(t.df), replications = 20, order = 'relative',
columns = c('test', 'elapsed', 'relative')
test elapsed relative
5 f_tommy_mat(t.df) 0.135 1.000000
2 f_ramnath(t.df) 0.172 1.274074
4 f_tommy(t.df) 0.311 2.303704
3 f_joran(t.df) 0.705 5.222222
1 f_long(t.df) 2.411 17.859259
Another option for when you do need mixed column types (and so you can't use matrix) is := in data.table. Example from ?":=" :
require(data.table)
m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)
system.time(for (i in 1:1000) DF[i,1] <- i)
# 591 seconds
system.time(for (i in 1:1000) DT[i,V1:=i])
# 1.16 seconds ( 509 times faster )

Resources