Which function/algorithm for this merging and filling operation? - algorithm

I have written R code that merges two data frames based on first column and for missing data adds the value from above. Here is what is does:
Two input data frames:
1 a
2 b
3 c
5 d
And
1 e
4 f
6 g
My code gives this output:
1 a e
2 b e
3 c e
4 c f
5 d f
6 d g
My code is however inefficient as it is not vectorized properly. Are there some R functions which I could use? Basically a function I am looking for is that fills in missing values / NA values and takes the value from previous element and puts it in place of NA.
I looked through reference book of R, but could not find anything.

Here is a solution making use of zoo::na.locf
library(zoo)
a <- data.frame(id=c(1,2,3,5), v=c("a","b","c", "d"))
b <- data.frame(id=c(1,4,6), v=c("e", "f", "g"))
n <- max(c(a$id, b$id))
an <- merge(data.frame(id=1:n), a, all.x=T)
bn <- merge(data.frame(id=1:n), b, all.x=T)
an$v <- na.locf(an$v)
bn$v <- na.locf(bn$v)
data.frame(an$id, an$v, bn$v)
an.id an.v bn.v
1 1 a e
2 2 b e
3 3 c e
4 4 c f
5 5 d f
6 6 d g

Related

destructure sequence into lexical variables

I have a sequence with a known number of elements (from a pcre match) and would like to map this into lexical variables.
I can probably loop over the sequence and put every element onto the stack and then :> ( a b c d ) but is there an idiomatic way to do this ?
Oh and my sequence has more than 4 elements, so first4 doesn't cut it, although I could obviously use first4 and then first3 on a subset of the sequence.
If you are sure that's want you really want to do, you could use firstn from quotations.generalizations:
SYMBOLS: a b c d e f g h ;
[let
{ 1 2 3 4 5 6 7 8 }
8 firstn :> ( a b c d e f g h )
a b c d e f g h . . . . . . . . ]
But it sounds like a bad idea. It's tricky, because the lexical variables are not "real" variables, the compiler converts them into stack shuffling. That's why they don't play nice with macros and :> can't be called like a regular word.
If you use dynamic variables it's easier:
SYMBOLS: a b c d e f g h ;
{ 1 2 3 4 5 6 7 8 }
{ a b c d e f g h } [ set ] 2each
{ a b c d e f g h } [ get . ] each

sorting a dataframe by values and storing index and columns

I have a pandas DataFrame which is actually a matrix. It looks as shown below
a b c
d 1 0 5
e 0 6 2
f 2 0 3
I need the values to be sorted and need the values of index and columns of them. the result should be
index Column Value
e b 6
d c 5
f c 3
You need stack for reshape with nlargest:
df1 = df.stack().nlargest(3).rename_axis(['idx','col']).reset_index(name='val')
print (df1)
idx col val
0 e b 6
1 d c 5
2 f c 3
For MultiIndex:
df2 = df.stack().nlargest(3).to_frame(name='val')
print (df2)
val
e b 6
d c 5
f c 3

Pandas pivot table Nested Sorting Part 3

Episode 3:
In part 2, we retained the hierarchical nature of the indices while sorting within right-most level. In part 1, we applied a custom sort to the left-most index level while sorting the values within the right-most index.
Now, I'd like to combine both methods.
Given the following data frame and resultant pivot table:
import pandas as pd
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
df
A B C D
0 a x a 7
1 a y b 5
2 a z a 3
3 a x b 4
4 a y a 1
5 b z b 6
6 b x a 5
7 b y b 3
8 b z a 1
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
D
A B C
a x a 7
b 4
y a 1
b 5
z a 3
b x a 5
y b 3
z a 1
b 6
I would like to specify a custom order of 'B'.
This seems to work:
df['B']=df['B'].astype('category')
df['B'].cat.set_categories(['z','x','y'],inplace=True)
Next, I'd like for the pivot table to keep the order for 'B' specified above while sorting the values 'D' descendingly within each category of 'B'.
Like this:
D
A B C
z a 3
x a 7
a b 4
y b 5
a 1
z b 6
b a 1
x a 5
y b 3
Thanks in advance!
UPDATE: using pivot_table()
In [79]: df.pivot_table(index=['A','B','C'], aggfunc='sum').reset_index().sort_values(['A','B','D'], ascending=[1,1,0]).set_index(['A','B','C'])
Out[79]:
D
A B C
a x a 7
b 4
y b 5
a 1
z a 3
b x a 5
y b 3
z b 6
a 1
is that what you want?
In [64]: df.sort_values(['A','B','D'], ascending=[1,1,0]).set_index(['A','B','C'])
Out[64]:
D
A B C
a z a 3
x a 7
b 4
y b 5
a 1
b z b 6
a 1
x a 5
y b 3

converting 3 variable into a matrix form to create a heatmap in SAS

I'm trying to convert 3 vairables into a matrix, for expample if you have the following:
(CHAR) (char) (num)
Var1 Var2 Var3
A B 1
C D 2
E F 3
A D 4
A F 5
C B 6
C F 7
E B 8
E D 9
Any ideas on how to convert the above three variables into this form of matrix below and my goal is to construct a heatmap using this matix
B D F
A 1 4 5
C 6 2 7
E 8 9 3
Can anyone help me do this in SAS, either using SAS/IML or other Procedure? Thanks!
Assuming you are using a recent version of SAS/IML (13.1 or later), use the HEATMAPCONT or HEATMAPDISC call:
proc iml;
m = {1 4 5,
6 2 7,
8 9 3};
call heatmapcont(m) xvalues={B D F} yvalues={A C E};
For details, see Creating heat maps in SAS/IML
It will be better if you post your code first then ask questions.
I think proc transpose is the fastest solution.
data _t1;
input var1 $ var2 $ var3 5.;
cards;
A B 1
C D 2
E F 3
A D 4
A F 5
C B 6
C F 7
E B 8
E D 9
run;
proc sort data=_t1;by var1;run;
proc transpose data=_t1 out=_t2(drop=_name_ rename=(var1=HereUpToYou));
by var1;
var var3;
id var2;
run;

How to sort dataframe in R with specified column order preservation?

Let's say I have a data.frame
x <- data.frame(a = c('A','A','A','A','A', 'C','C','C','C', 'B','B','B'),
b = c('a','c','a','a','c', 'd', 'e','e','d', 'b','b','b'),
c = c( 7, 3, 2, 4, 5, 3, 1, 1, 5, 5, 2, 3),
stringsAsFactors = FALSE)
> x
a b c
1 A a 7
2 A c 3
3 A a 2
4 A a 4
5 A c 5
6 C d 3
7 C e 1
8 C e 1
9 C d 5
10 B b 5
11 B b 2
12 B b 3
I would like to sort x by columns b and c but keeping order of a as before. x[order(x$b, x$c),] - breaks order of column a. This is what I want:
a b c
3 A a 2
4 A a 4
1 A a 7
2 A c 3
5 A c 5
6 C d 3
9 C d 5
7 C e 1
8 C e 1
11 B b 2
12 B b 3
10 B b 5
Is there a quick way of doing it?
Currently I run "for" loop and sort each subset, I'm sure there must be a better way.
Thank you!
Ilya
If column "a" is ordered already, then its this simple:
> x[order(x$a,x$b, x$c),]
a b c
3 A a 2
4 A a 4
1 A a 7
2 A c 3
5 A c 5
6 B d 3
9 B d 5
7 B e 1
8 B e 1
11 C b 2
12 C b 3
10 C b 5
If column a isn't ordered (but is grouped), create a new factor with the levels of x$a and use that.
Thank you Spacedman! Your recommendation works well.
x$a <- factor(x$a, levels = unique(x$a), ordered = TRUE)
x[order(x$a,x$b, x$c),]
Following Gavin's comment
x$a <- factor(x$a, levels = unique(x$a))
x[order(x$a,x$b, x$c),]
require(doBy)
orderBy(~ a + b + c, data=x)

Resources