Pulling data after sort the table - sorting

My problem is pulling right variables from data. My data is as below:
id term grade number
35 2005 I 0
35 2005 F 1
35 2005 W 2
46 2003 A 0
46 2003 B 1
46 2003 F 2
46 2003 I 3
I sorted the table I have and gave number 0-1-2 and so on. This is the example after sorting. What I need is if the same id's grades are starts with I and F and W. Like id 35. So I need in this table is first three observations id 35.

Here is one Proc SQL approach, you can also try 2XDOW:
data have;
input (id term grade) (:$8.) number;
cards;
35 2005 I 0
35 2005 F 1
35 2005 W 2
46 2003 A 0
46 2003 B 1
46 2003 F 2
46 2003 I 3
;
proc sql;
create table want as
select * from have
group by id
having sum(GRADE='I' AND NUMBER=0) >0
AND sum(GRADE='F' AND NUMBER=1) >0
AND sum(GRADE='W' AND NUMBER=2) >0
;
QUIT;

Related

Creating subset of a data

I have a column called Project_Id which lists the names of many different projects, say Project A, Project B and so on. A second column lists the sales for each project.
A third column shows time series information. For example:
Project_ID Sales Time Series Information
A 10 1
A 25 2
A 31 3
A 59 4
B 22 1
B 38 2
B 76 3
C 82 1
C 23 2
C 83 3
C 12 4
C 90 5
D 14 1
D 62 2
From this dataset, I need to choose (and thus create a new data set) only those projects which have at least 4 time series points, say, to get the new dataset (How do I get this by using an R code, is my question):
Project_ID Sales Time Series Information
A 10 1
A 25 2
A 31 3
A 59 4
C 82 1
C 23 2
C 83 3
C 12 4
C 90 5
Could someone please help?
Thanks a lot!
I tried to do some filtering with R but had little success.

Calculate within correlation in panel data long form

We have a simple panel data set in long form, which has the following structure:
i t x
1 Aug-2011 282
2 Aug-2011 -220
1 Sep-2011 334
2 Sep-2011 126
1 Sep-2012 -573
2 Sep-2012 305
1 Nov-2013 335
2 Nov-2013 205
3 Nov-2013 485
I would like to get the cross-correlation between each i within the time-variable t.
This would be possible by converting the data in wide format. Unfortunately, this approach is not feasible due to the big number of i and t values in the real data set.
Is it possible to do something like in this fictional command:
by (tabulate t): corr x
You can easily calculate the correlations of a single variable such as x across panel groups using the reshape option of the community-contributed command pwcorrf:
ssc install pwcorrf
For illustration, consider (a slightly simplified version of) your toy example:
clear
input i t x
1 2011 282
2 2011 -220
1 2012 334
2 2012 126
1 2013 -573
2 2013 305
1 2014 335
2 2014 205
3 2014 485
end
xtset i t
panel variable: i (unbalanced)
time variable: t, 2011 to 2014
delta: 1 unit
pwcorrf x, reshape
Variable(s): x
Panel var: i
corrMatrix[3,3]
1 2 3
1 1 0 0
2 -.54223207 1 0
3 . . 1

Group sequential numbers into min and max group pairs

I have a table with a list of numbers. Each number belongs to an entity.
Entity Number
1 1
1 2
1 3
1 4
...
1 20
2 21
2 22
2 23
1 24
2 25
2 26
2 30
2 31
2 32
2 33
The goal is to list the numbers, grouped by the entities as ranges (min-max pairs).
I need to find a way to group the above table as:
Entity Min Max
1 1 20
2 21 23
1 24 24
2 25 26
2 30 33
I've succesfully done this in my education, but I always found it hard and can't remember how the algorithm was done
This looks similar to SQL Data Range Min Max category
and TSQL Select Min & Max row when grouping

Stata overwrite all observations in cross section except last 20 non NA

I have a large strongly unbalanced panel in Stata, where each cross section only has a few observations, and the rest is NA (.).
I want to overwrite all non NA observations that are not the last 20 non NA observations, in each cross section. I'm not sure how to correctly specify the range, but you can see my thoughts below. There are gaps between the observations.
Thanks
*Edit
I removed the code as it created uncertainty. It was included to show what I had tried.
My cross section dimension identifier is xsection
My time dimension identifier is id01
*Edit
I have created an example below. The code needs to extract the last 3 non NA (.) values of each cross section in variable x, and enter these into a new variable z. Alternatively, all observations in x should be set to . except the last 3 (with allowed gaps). It does not matter if a new variable z is created, or the observations in x is replaced so that it looks like z.
id01 xsection x z
2005 1 20 .
2006 1 21 .
2007 1 22 .
2008 1 23 23
2009 1 37 37
2010 1 38 38
2011 1 . .
2012 1 . .
2005 2 24 .
2006 2 25 .
2007 2 21 .
2008 2 27 27
2009 2 33 33
2010 2 . .
2011 2 37 37
2012 2 . .
Note that NA is the jargon of some other programs, but not native to Stata. Stata calls these "missing values".
If you just (1) segregate the observations with missing values, then immediately (2) identifying the last so many observations with non-missing values follows from sorting within the other observations, those with non-missing values.
. clear
. input id01 xsection x z
id01 xsection x z
1. 2005 1 20 .
2. 2006 1 21 .
3. 2007 1 22 .
4. 2008 1 23 23
5. 2009 1 37 37
6. 2010 1 38 38
7. 2011 1 . .
8. 2012 1 . .
9. 2005 2 24 .
10. 2006 2 25 .
11. 2007 2 21 .
12. 2008 2 27 27
13. 2009 2 33 33
14. 2010 2 . .
15. 2011 2 37 37
16. 2012 2 . .
17. end
. gen ismiss = missing(x)
. bysort ismiss xsection (id01) : gen z_last = z if _N - _n < 3
(10 missing values generated)
. sort id01 xsection
. assert z_last == z
Here z was supplied as what was wanted and z_last is calculated and shown to be equivalent.
This answer is a bit clunky, but it should get the job done. If x is the variable that you want to replace values to missing,
by xsection: gen maxCount = _N
by xsection: gen counter = _n
gen dropVar = maxCount - counter
replace x = . if dropVar >= 20
I am fairly sure that the equal sign should be included, but this would be easy to check.

pandas groupby sort descending order

pandas groupby will by default sort. But I'd like to change the sort order. How can I do this?
I'm guessing that I can't apply a sort method to the returned groupby object.
Do your groupby, and use reset_index() to make it back into a DataFrame. Then sort.
grouped = df.groupby('mygroups').sum().reset_index()
grouped.sort_values('mygroups', ascending=False)
As of Pandas 0.18 one way to do this is to use the sort_index method of the grouped data.
Here's an example:
np.random.seed(1)
n=10
df = pd.DataFrame({'mygroups' : np.random.choice(['dogs','cats','cows','chickens'], size=n),
'data' : np.random.randint(1000, size=n)})
grouped = df.groupby('mygroups', sort=False).sum()
grouped.sort_index(ascending=False)
print grouped
data
mygroups
dogs 1831
chickens 1446
cats 933
As you can see, the groupby column is sorted descending now, indstead of the default which is ascending.
Similar to one of the answers above, but try adding .sort_values() to your .groupby() will allow you to change the sort order. If you need to sort on a single column, it would look like this:
df.groupby('group')['id'].count().sort_values(ascending=False)
ascending=False will sort from high to low, the default is to sort from low to high.
*Careful with some of these aggregations. For example .size() and .count() return different values since .size() counts NaNs.
What is the difference between size and count in pandas?
Other instance of preserving the order or sort by descending:
In [97]: import pandas as pd
In [98]: df = pd.DataFrame({'name':['A','B','C','A','B','C','A','B','C'],'Year':[2003,2002,2001,2003,2002,2001,2003,2002,2001]})
#### Default groupby operation:
In [99]: for each in df.groupby(["Year"]): print each
(2001, Year name
2 2001 C
5 2001 C
8 2001 C)
(2002, Year name
1 2002 B
4 2002 B
7 2002 B)
(2003, Year name
0 2003 A
3 2003 A
6 2003 A)
### order preserved:
In [100]: for each in df.groupby(["Year"], sort=False): print each
(2003, Year name
0 2003 A
3 2003 A
6 2003 A)
(2002, Year name
1 2002 B
4 2002 B
7 2002 B)
(2001, Year name
2 2001 C
5 2001 C
8 2001 C)
In [106]: df.groupby(["Year"], sort=False).apply(lambda x: x.sort_values(["Year"]))
Out[106]:
Year name
Year
2003 0 2003 A
3 2003 A
6 2003 A
2002 1 2002 B
4 2002 B
7 2002 B
2001 2 2001 C
5 2001 C
8 2001 C
In [107]: df.groupby(["Year"], sort=False).apply(lambda x: x.sort_values(["Year"])).reset_index(drop=True)
Out[107]:
Year name
0 2003 A
1 2003 A
2 2003 A
3 2002 B
4 2002 B
5 2002 B
6 2001 C
7 2001 C
8 2001 C
You can do a sort_values() on the dataframe before you do the groupby. Pandas preserves the ordering in the groupby.
In [44]: d.head(10)
Out[44]:
name transcript exon
0 ENST00000456328 2 1
1 ENST00000450305 2 1
2 ENST00000450305 2 2
3 ENST00000450305 2 3
4 ENST00000456328 2 2
5 ENST00000450305 2 4
6 ENST00000450305 2 5
7 ENST00000456328 2 3
8 ENST00000450305 2 6
9 ENST00000488147 1 11
for _, a in d.head(10).sort_values(["transcript", "exon"]).groupby(["name", "transcript"]): print(a)
name transcript exon
1 ENST00000450305 2 1
2 ENST00000450305 2 2
3 ENST00000450305 2 3
5 ENST00000450305 2 4
6 ENST00000450305 2 5
8 ENST00000450305 2 6
name transcript exon
0 ENST00000456328 2 1
4 ENST00000456328 2 2
7 ENST00000456328 2 3
name transcript exon
9 ENST00000488147 1 11
This kind of operation is covered under hierarchical indexing. Check out the examples here
When you groupby, you're making new indices. If you also pass a list through .agg(). you'll get multiple columns. I was trying to figure this out and found this thread via google.
It turns out if you pass a tuple corresponding to the exact column you want sorted on.
Try this:
# generate toy data
ex = pd.DataFrame(np.random.randint(1,10,size=(100,3)), columns=['features', 'AUC', 'recall'])
# pass a tuple corresponding to which specific col you want sorted. In this case, 'mean' or 'AUC' alone are not unique.
ex.groupby('features').agg(['mean','std']).sort_values(('AUC', 'mean'))
This will output a df sorted by the AUC-mean column only.
use 'by' argument in 'sort_values' clause
A generic example -'Customer Name' and 'Profit' are columns
df.groupby('Customer Name').Profit.agg(['count', 'min', 'max',
'mean']).sort_values(by = ['count'], ascending=False)

Resources