pandas groupby will by default sort. But I'd like to change the sort order. How can I do this?
I'm guessing that I can't apply a sort method to the returned groupby object.
Do your groupby, and use reset_index() to make it back into a DataFrame. Then sort.
grouped = df.groupby('mygroups').sum().reset_index()
grouped.sort_values('mygroups', ascending=False)
As of Pandas 0.18 one way to do this is to use the sort_index method of the grouped data.
Here's an example:
np.random.seed(1)
n=10
df = pd.DataFrame({'mygroups' : np.random.choice(['dogs','cats','cows','chickens'], size=n),
'data' : np.random.randint(1000, size=n)})
grouped = df.groupby('mygroups', sort=False).sum()
grouped.sort_index(ascending=False)
print grouped
data
mygroups
dogs 1831
chickens 1446
cats 933
As you can see, the groupby column is sorted descending now, indstead of the default which is ascending.
Similar to one of the answers above, but try adding .sort_values() to your .groupby() will allow you to change the sort order. If you need to sort on a single column, it would look like this:
df.groupby('group')['id'].count().sort_values(ascending=False)
ascending=False will sort from high to low, the default is to sort from low to high.
*Careful with some of these aggregations. For example .size() and .count() return different values since .size() counts NaNs.
What is the difference between size and count in pandas?
Other instance of preserving the order or sort by descending:
In [97]: import pandas as pd
In [98]: df = pd.DataFrame({'name':['A','B','C','A','B','C','A','B','C'],'Year':[2003,2002,2001,2003,2002,2001,2003,2002,2001]})
#### Default groupby operation:
In [99]: for each in df.groupby(["Year"]): print each
(2001, Year name
2 2001 C
5 2001 C
8 2001 C)
(2002, Year name
1 2002 B
4 2002 B
7 2002 B)
(2003, Year name
0 2003 A
3 2003 A
6 2003 A)
### order preserved:
In [100]: for each in df.groupby(["Year"], sort=False): print each
(2003, Year name
0 2003 A
3 2003 A
6 2003 A)
(2002, Year name
1 2002 B
4 2002 B
7 2002 B)
(2001, Year name
2 2001 C
5 2001 C
8 2001 C)
In [106]: df.groupby(["Year"], sort=False).apply(lambda x: x.sort_values(["Year"]))
Out[106]:
Year name
Year
2003 0 2003 A
3 2003 A
6 2003 A
2002 1 2002 B
4 2002 B
7 2002 B
2001 2 2001 C
5 2001 C
8 2001 C
In [107]: df.groupby(["Year"], sort=False).apply(lambda x: x.sort_values(["Year"])).reset_index(drop=True)
Out[107]:
Year name
0 2003 A
1 2003 A
2 2003 A
3 2002 B
4 2002 B
5 2002 B
6 2001 C
7 2001 C
8 2001 C
You can do a sort_values() on the dataframe before you do the groupby. Pandas preserves the ordering in the groupby.
In [44]: d.head(10)
Out[44]:
name transcript exon
0 ENST00000456328 2 1
1 ENST00000450305 2 1
2 ENST00000450305 2 2
3 ENST00000450305 2 3
4 ENST00000456328 2 2
5 ENST00000450305 2 4
6 ENST00000450305 2 5
7 ENST00000456328 2 3
8 ENST00000450305 2 6
9 ENST00000488147 1 11
for _, a in d.head(10).sort_values(["transcript", "exon"]).groupby(["name", "transcript"]): print(a)
name transcript exon
1 ENST00000450305 2 1
2 ENST00000450305 2 2
3 ENST00000450305 2 3
5 ENST00000450305 2 4
6 ENST00000450305 2 5
8 ENST00000450305 2 6
name transcript exon
0 ENST00000456328 2 1
4 ENST00000456328 2 2
7 ENST00000456328 2 3
name transcript exon
9 ENST00000488147 1 11
This kind of operation is covered under hierarchical indexing. Check out the examples here
When you groupby, you're making new indices. If you also pass a list through .agg(). you'll get multiple columns. I was trying to figure this out and found this thread via google.
It turns out if you pass a tuple corresponding to the exact column you want sorted on.
Try this:
# generate toy data
ex = pd.DataFrame(np.random.randint(1,10,size=(100,3)), columns=['features', 'AUC', 'recall'])
# pass a tuple corresponding to which specific col you want sorted. In this case, 'mean' or 'AUC' alone are not unique.
ex.groupby('features').agg(['mean','std']).sort_values(('AUC', 'mean'))
This will output a df sorted by the AUC-mean column only.
use 'by' argument in 'sort_values' clause
A generic example -'Customer Name' and 'Profit' are columns
df.groupby('Customer Name').Profit.agg(['count', 'min', 'max',
'mean']).sort_values(by = ['count'], ascending=False)
Related
I have the following data:
a b c d
5 9 6 0
3 1 3 2
Characters in the first row, numbers in the second row.
How do I get the character corresponding to the highest number in the second row, and how do I increase the corresponding number in the second row? (For example, here, column b has the highest number, 9, so increase that number by 10%.)
I use Dyalog version 17.1.
With:
⎕←data←3 4⍴'a' 'b' 'c' 'd' 5 9 6 0 3 1 3 2
a b c d
5 9 6 0
3 1 3 2
You can extract the second row with:
2⌷data
5 9 6 0
Now grade it descending, that is, find the indices that would sort it from highest to lowest:
⍒2⌷data
2 3 1 4
The first number is the column we're looking for:
⊃⍒2⌷data
2
Now we can use this to extract the character from the first row:
data[⊂1,⊃⍒2⌷data]
b
But we only need the column index, not the actual character. The full index of the number we want to increase is:
2,⊃⍒2⌷data
2 2
Extracting the data to see that we got the right index:
data[⊂2,⊃⍒2⌷data]
9
Now we can either create a new array with the target value increased by 10%:
1.1×#(⊂2,⊃⍒2⌷data)⊢data
a b c d
5 9.9 6 0
3 1 3 2
Or change it in-place:
data[⊂2,⊃⍒2⌷data]×←1.1
data
a b c d
5 9.9 6 0
3 1 3 2
Try it online!
Hi I am trying to do get some data displayed using FILTER function in google sheets.
What i want is the minimum value across 3 columns on 1 row.
Is this possible?
For example:
A 1 6 10
B 3 5 9
C 4 4 8
D 5 3 7
A 2 1 6
Filter on A should give:
A 1
A 1
Filter on B should give:
B 3
I would really like to use filter function but =filter({A:A,min(B:D)},A:A="A") doesn't work.
Maybe, if your three (labelled) columns are A, B and C:
=filter(A2:C2,A2:C2=min(A2:C2))
but in that case filter would be overkill.
Given the following data frame:
df = pd.DataFrame({'A' : ['1','2','3','7'],
'B' : [7,6,5,4],
'C' : [5,6,7,1],
'D' : [1,9,9,8]})
df=df.set_index('A')
df
B C D
A
1 7 5 1
2 6 6 9
3 5 7 9
7 4 1 8
I want to sort the order of the columns descendingly on the bottom row like this:
D B C
A
1 1 7 5
2 9 6 6
3 9 5 7
7 8 4 1
Thanks in advance!
Easiest way is to take the transpose, then sort_values, then transpose back.
df.T.sort_values('7', ascending=False).T
or
df.T.sort_values(df.index[-1], ascending=False).T
Gives:
D B C
A
1 1 7 5
2 9 6 6
3 9 5 7
7 8 4 1
Testing
my solution
%%timeit
df.T.sort_values(df.index[-1], ascending=False).T
1000 loops, best of 3: 444 µs per loop
alternative solution
%%timeit
df[[c for c in sorted(list(df.columns), key=df.iloc[-1].get, reverse=True)]]
1000 loops, best of 3: 525 µs per loop
You can use sort_values (by the index position of your row) with axis=1:
df.sort_values(by=df.index[-1],axis=1,inplace=True)
Here is a variation that does not involve transposition:
df = df[[c for c in sorted(list(df.columns), key=df.iloc[-1].get, reverse=True)]]
This is a question that could help me to solve another, still unsolved question I posted. Basically I need to condition a dataset in Stata and I thought a procedure which would need to first store certain values of a variable in a sort of matrix and then use compare the values of another variable with those stored in the matrix. A simple example could be the following:
obs id act1 act2 year act1year
1 1 0 1 2000 0
2 1 1 0 2001 2001
3 1 0 1 2004 0
4 2 1 0 2001 2001
5 2 1 0 2002 2002
6 2 0 1 2004 0
The code should be able to save in the matrix by(id) the value of act1year different from 0 (in this case 2001) for group 1 and then check if this value, for observations for which act2 is 1, is included in the range for obs i=1,3 [year(i) : year(i)-2] in this case the range does not contain the value stored in the matrix; therefore the observation will be dropped. For group id 2 the code should store [2001, 2002] and then check if the range [year(6):year(6)-2] contains any of the values stored in the matrix.
I hope my question is clear enough! Apologies for not posting any attempt but this is something I really have no idea about how to do.
Both this question and the previous discussion are difficult for me to understand, so let me suggest the following as a starting point to a solution that identifies observations for which either (a) act1 occurs or (b) act2 occurs no more than 2 years after the most recent act1 occurrence.
clear
input id act1 act2 year
1 0 1 2000
1 1 0 2001
1 0 1 2004
2 1 0 2001
2 1 0 2002
2 0 1 2004
end
generate a1yr = 0
replace a1yr = year if act1==1
generate act1r = -act1
bysort id (year act1r): replace a1yr=a1yr[_n-1] if a1yr==0 & _n>1
generate tokeep = 0
replace tokeep = 1 if act1==1
replace tokeep = 1 if act2==1 & year-a1yr<=2
list, clean noobs
Looking at the previous discussion, as it now stands, suggests substituting the following data into the code above and seeing if the code then meets the needs of that discussion.
input obsno id act1 act2 year
1 1 1 0 2000
2 1 0 1 2001
3 1 0 1 2002
4 1 0 1 2002
5 1 0 1 2003
6 2 1 0 2000
7 2 1 0 2001
8 2 0 1 2002
9 2 0 1 2002
10 2 0 1 2003
end
I have a question on using SAS for data structure transfer. This is my old dataset
question answer
1 3
2 4
3 5
4 3
5 1
1 2
2 4
3 1
4 3
5 6
The ideal output dataset is
ques1 ques2 ques3 ques4 ques5
3 4 5 3 1
2 4 1 3 6
The solution is simple. Create a dummy column which stores the questions group and then transpose that data with by variable as that group causing 2 separate output rows. Check out the following code.
data have;
infile datalines missover;
input question answer ;
if question=1 then group+1;
datalines;
1 3
2 4
3 5
4 3
5 1
1 2
2 4
3 1
4 3
5 6
;;;;
run;
proc transpose data=have out=want prefix=ques;
by group;
var answer;
id question;
run;
proc print data=want;run;