frequency count of a column based on two other columns with datatable - datatable

I am asking myself the following question.
Is there a smart way to solve the problem using the package data.table instead of using the following code:
install.packages("dplyr")
library(dplyr)
data %>% group_by(Ticker, Year) %>% summarise(count = length(Value[!is.na(Value)]))

Do you mean this?
(Note: Sample data is based on data provided in your previous post here).
library(data.table);
setDT(df)[, .(count = sum(!is.na(Value))), by = list(RANDOM, Year)];
# RANDOM Year count
# 1: D 2010 2
# 2: C 2010 2
# 3: B 2008 5
# 4: D 2009 4
# 5: D 2008 4
# 6: A 2009 3
# 7: B 2009 5
# 8: C 2008 4
# 9: A 2008 8
#10: A 2010 2
#11: B 2010 1
#12: C 2009 8
Sample data
set.seed(2017);
RANDOM <- sample(c("A","B","C","D"), size = 100, replace = TRUE)
Year <- sample(c(2008,2009,2010), 100, TRUE)
Value <- sample(c(0.22, NA), 100, TRUE)
df <- data.frame(RANDOM, Year, Value);

Related

Generate rows based to make a sequence in a column of a dataframe

I'm trying to generate new rows based on values in a certain column. In current data as you can see 'days_left' column does not have all sequential values.
current = {'assignment': [1,1,1,1,2,2,2,2,2], 'days_left': [1, 2, 5, 9,1, 3, 4, 8, 13]}
dfcurrent = pd.DataFrame(data=current)
dfcurrent
While I want to generate rows into that dataframe to create make sequential list for for 'days_left' for each 'assignment'. Please see the desidered output below:
desired = {'assignment': [1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2],
'days_left': [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,10,11,12,13]}
dfdesired = pd.DataFrame(data=desired)
dfdesired
Note: The original data is much bigger and has other columns as well but I just simplified it for this question.
Could you please help me how I can solve this?
Thank you very much in advance!
You can iterate through the rows of the current dataframe and create a new dataframe. For each days_left range, copy the current row to the new dataframe and update the days_left column value.
Try this code:
import pandas as pd
current = {'assignment': [1,1,1,1,2,2,2,2,2], 'days_left': [1, 2, 5, 9, 1, 3, 4, 8, 13]}
dfc = pd.DataFrame(data=current)
dfd = pd.DataFrame() # new dataframe
for r in range(1,len(dfc)): # start at 2nd row
for i in range(dfc.iloc[r-1]['days_left'],dfc.iloc[r]['days_left']): # fill gap of missing numbers
dfd = dfd.append(dfc.iloc[r]) # copy row
dfd.reset_index(drop=True, inplace=True) # prevent index duplication
dfd.loc[len(dfd)-1, 'days_left'] = i # update column value
if r == len(dfc)-1 or dfc.iloc[r+1]['assignment']!=dfc.iloc[r]['assignment']: # last entry in assignment
dfd = dfd.append(dfc.iloc[r]) # copy row
dfd.reset_index(drop=True, inplace=True) # prevent index duplication
dfd = dfd.astype(int) # convert all data to integers
print(dfd.to_string(index=False))
Output
assignment days_left
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
2 1
2 2
2 3
2 4
2 5
2 6
2 7
2 8
2 9
2 10
2 11
2 12
2 13

Randomization of treatments in R studio

I want to get a randomization of treatments with three levels and sample size n = 15. I'm stuck in where
volunteers <- 1:15
set.seed(1); sample(volunteers, size=5, replace=F)
I want three different groups, five each, but I'm new to R.
This is a data setup for ANOVA, not a specific question which gives particular data sets. Also I don't know what it means for set.seed
I think you are looking for something like that:
set.seed(1337)
# replace with you real participants ids
volunteers <- 1:15
# set the number of groups
number.of.groups <- 1:3
# set group size
group.size <- 5
# generate data frame with participant > group order
df <- data.frame(group=sort(rep(number.of.groups,group.size)),
participant=sample(volunteers,length(volunteers)))
# show your groups
df[which(df$group==1),]
# group participant
# 1 1 9
# 2 1 8
# 3 1 1
# 4 1 6
# 5 1 5
df[which(df$group==2),]
# group participant
# 6 2 4
# 7 2 15
# 8 2 3
# 9 2 2
# 10 2 13
df[which(df$group==3),]
# group participant
# 11 3 11
# 12 3 10
# 13 3 14
# 14 3 12
# 15 3 7
And you only need to use set.seed() if want to be able to replicate your samples since this method causes that you always draw the same "random" samples. Consequently, set.seed() is more for testing than for real analysis code. The seed you set is by the way irrelevant. If you want to replicate just make sure to always set the same seed.
How about:
install.packages("randomizr")
library(randomizr)
Z <- complete_ra(15, num_arms = 3)
table(Z)
This gives
> table(Z)
Z
T1 T2 T3
5 5 5

Setting With Enlargement in daru

There is some way to Setting With Enlargement in daru? Something similar to pandas with loc.
Yes you can.
For Daru::Vector objects use the #push method like so:
require 'daru'
v = Daru::Vector.new([1,2,3], index: [:a,:b,:c])
v.push(23, :r)
v
#=>
#<Daru::Vector:74005360 #name = nil #size = 4 >
# nil
# a 1
# b 2
# c 3
# r 23
For setting a new vector in Daru::DataFrame, call the #[]= method with your new name inside the []. You can either assign a Daru::Vector or an Array.
If you assign Daru::Vector, the data will be aligned so that the indexes of the DataFrame and Vector match.
For example,
require 'daru'
df = Daru::DataFrame.new({a: [1,2,3], b: [5,6,7]})
df[:r] = [11,22,33]
df
# =>
#<Daru::DataFrame:73956870 #name = c8a65ffe-217d-43bb-b6f8-50d2530ec053 #size = 3>
# a b r
# 0 1 5 11
# 1 2 6 22
# 2 3 7 33
You assign a row with the DataFrame#row[]= method. For example, using the previous dataframe df:
df.row[:a] = [23,35,2]
df
#=>
#<Daru::DataFrame:73956870 #name = c8a65ffe-217d-43bb-b6f8-50d2530ec053 #size = 4>
# a b r
# 0 1 5 11
# 1 2 6 22
# 2 3 7 33
# a 23 35 2
Assigning a Daru::Vector will align according to the names of the vectors of the Daru::DataFrame.
You can see further details in these notebooks.
Hope this answers your question.

Take out elements from a vector that meets certain condition

I have two vectors, A = [1,3,5] and B = [1,2,3,4,5,6,7,8,9,10]. I want to get C=[2,4,6,7,8,9,10] by extracting some elements from B that A doesn't have.
I don't want to use loops, because this is a simplified problem from a real data simulation. In the real case A and B are huge, but A is included in B.
Here are two methods,
C=setdiff(B,A)
but if values are repeated in B they will only come up once in C, or
C=B(~ismember(B,A))
which will preserve repeated values in B.
One approach with unique, sort and diff -
C = [A B];
[~,~,idC] = unique(C);
[sidC,id_idC] = sort(idC);
start_id = id_idC(diff([0 sidC])==1);
out = C(start_id(start_id>numel(A)))
Sample runs -
Case #1 (Sample from question):
A =
1 3 5
B =
1 2 3 4 5 6 7 8 9 10
out =
2 4 6 7 8 9 10
Case #2 (Bit more generic case):
A =
11 15 14
B =
19 14 6 8 9 11 15
out =
6 8 9 19

pandas groupby sort descending order

pandas groupby will by default sort. But I'd like to change the sort order. How can I do this?
I'm guessing that I can't apply a sort method to the returned groupby object.
Do your groupby, and use reset_index() to make it back into a DataFrame. Then sort.
grouped = df.groupby('mygroups').sum().reset_index()
grouped.sort_values('mygroups', ascending=False)
As of Pandas 0.18 one way to do this is to use the sort_index method of the grouped data.
Here's an example:
np.random.seed(1)
n=10
df = pd.DataFrame({'mygroups' : np.random.choice(['dogs','cats','cows','chickens'], size=n),
'data' : np.random.randint(1000, size=n)})
grouped = df.groupby('mygroups', sort=False).sum()
grouped.sort_index(ascending=False)
print grouped
data
mygroups
dogs 1831
chickens 1446
cats 933
As you can see, the groupby column is sorted descending now, indstead of the default which is ascending.
Similar to one of the answers above, but try adding .sort_values() to your .groupby() will allow you to change the sort order. If you need to sort on a single column, it would look like this:
df.groupby('group')['id'].count().sort_values(ascending=False)
ascending=False will sort from high to low, the default is to sort from low to high.
*Careful with some of these aggregations. For example .size() and .count() return different values since .size() counts NaNs.
What is the difference between size and count in pandas?
Other instance of preserving the order or sort by descending:
In [97]: import pandas as pd
In [98]: df = pd.DataFrame({'name':['A','B','C','A','B','C','A','B','C'],'Year':[2003,2002,2001,2003,2002,2001,2003,2002,2001]})
#### Default groupby operation:
In [99]: for each in df.groupby(["Year"]): print each
(2001, Year name
2 2001 C
5 2001 C
8 2001 C)
(2002, Year name
1 2002 B
4 2002 B
7 2002 B)
(2003, Year name
0 2003 A
3 2003 A
6 2003 A)
### order preserved:
In [100]: for each in df.groupby(["Year"], sort=False): print each
(2003, Year name
0 2003 A
3 2003 A
6 2003 A)
(2002, Year name
1 2002 B
4 2002 B
7 2002 B)
(2001, Year name
2 2001 C
5 2001 C
8 2001 C)
In [106]: df.groupby(["Year"], sort=False).apply(lambda x: x.sort_values(["Year"]))
Out[106]:
Year name
Year
2003 0 2003 A
3 2003 A
6 2003 A
2002 1 2002 B
4 2002 B
7 2002 B
2001 2 2001 C
5 2001 C
8 2001 C
In [107]: df.groupby(["Year"], sort=False).apply(lambda x: x.sort_values(["Year"])).reset_index(drop=True)
Out[107]:
Year name
0 2003 A
1 2003 A
2 2003 A
3 2002 B
4 2002 B
5 2002 B
6 2001 C
7 2001 C
8 2001 C
You can do a sort_values() on the dataframe before you do the groupby. Pandas preserves the ordering in the groupby.
In [44]: d.head(10)
Out[44]:
name transcript exon
0 ENST00000456328 2 1
1 ENST00000450305 2 1
2 ENST00000450305 2 2
3 ENST00000450305 2 3
4 ENST00000456328 2 2
5 ENST00000450305 2 4
6 ENST00000450305 2 5
7 ENST00000456328 2 3
8 ENST00000450305 2 6
9 ENST00000488147 1 11
for _, a in d.head(10).sort_values(["transcript", "exon"]).groupby(["name", "transcript"]): print(a)
name transcript exon
1 ENST00000450305 2 1
2 ENST00000450305 2 2
3 ENST00000450305 2 3
5 ENST00000450305 2 4
6 ENST00000450305 2 5
8 ENST00000450305 2 6
name transcript exon
0 ENST00000456328 2 1
4 ENST00000456328 2 2
7 ENST00000456328 2 3
name transcript exon
9 ENST00000488147 1 11
This kind of operation is covered under hierarchical indexing. Check out the examples here
When you groupby, you're making new indices. If you also pass a list through .agg(). you'll get multiple columns. I was trying to figure this out and found this thread via google.
It turns out if you pass a tuple corresponding to the exact column you want sorted on.
Try this:
# generate toy data
ex = pd.DataFrame(np.random.randint(1,10,size=(100,3)), columns=['features', 'AUC', 'recall'])
# pass a tuple corresponding to which specific col you want sorted. In this case, 'mean' or 'AUC' alone are not unique.
ex.groupby('features').agg(['mean','std']).sort_values(('AUC', 'mean'))
This will output a df sorted by the AUC-mean column only.
use 'by' argument in 'sort_values' clause
A generic example -'Customer Name' and 'Profit' are columns
df.groupby('Customer Name').Profit.agg(['count', 'min', 'max',
'mean']).sort_values(by = ['count'], ascending=False)

Resources