Pandas multiindex sort - sorting

In Pandas 0.19 I have a large dataframe with a Multiindex of the following form
C0 C1 C2
A B
bar one 4 2 4
two 1 3 2
foo one 9 7 1
two 2 1 3
I want to sort bar and foo (and many more double lines as them) according to "two" to get the following:
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3
I am interested in speed (as I have many columns and many pairs of rows). I am also happy with re-arranging the data if it speeds up the sorting. Many thanks

Here is a mostly numpy solution that should yield good performance. It first selects only the 'two' rows and argsorts them. It then sets this order for each row of the original dataframe. It then unravels this order (after adding a constant to offset each row) and the original dataframe values. It then reorders all the original values based on this unraveled, offset and argsorted array before creating a new dataframe with the intended sort order.
rows, cols = df.shape
df_a = np.argsort(df.xs('two', level=1))
order = df_a.reindex(df.index.droplevel(-1)).values
offset = np.arange(len(df)) * cols
order_final = order + offset[:, np.newaxis]
pd.DataFrame(df.values.ravel()[order_final.ravel()].reshape(rows, cols), index=df.index, columns=df.columns)
Output
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3
Some Speed tests
# create much larger frame
import string
idx = pd.MultiIndex.from_product((list(string.ascii_letters), list(string.ascii_letters) + ['two']))
df1 = pd.DataFrame(index=idx, data=np.random.rand(len(idx), 3), columns=['C0', 'C1', 'C2'])
#scott boston
%timeit df1.groupby(level=0).apply(sortit)
10 loops, best of 3: 199 ms per loop
#Ted
1000 loops, best of 3: 5 ms per loop

Here is a solution, albeit klugdy:
Input dataframe:
C0 C1 C2
A B
bar one 4 2 4
two 1 3 2
foo one 9 7 1
two 2 1 3
Custom sorting function:
def sortit(x):
xcolumns = x.columns.values
x.index = x.index.droplevel()
x.sort_values(by='two',axis=1,inplace=True)
x.columns = xcolumns
return x
df.groupby(level=0).apply(sortit)
Output:
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3

Related

Select only first rows in each h2o dataframe group_by group (for merging)?

Is there a way to select only first rows in each h2o dataframe group_by group?
The reason for doing this is to merge some columns in an h2o dataframe into a group_by'ed version of that dataframe that was created to get some stats. based on particular groupings in the original.
Example, suppose had two dataframes like
df1
receipt_key b c item_id
------------------------
a1 1 2 1
a2 3 4 1
and
df2
receipt_key e f item_id
--------------------------
a1 5 6 1
a1 7 8 2
a2 9 10 1
would like to join them such that end up with dataframe
df3
receipt_key b c e f item_id
-----------------------------
a1 1 2 5 6 1
a2 3 4 9 10 1
Have tried doing something like df2.group_by('receipt_key').max('item_id') to merge into df1, but doing so only leaves the item_id column in the group's get_frame() dataframe (and even listing all of the columns in df2 to max() on would not give the right values as well as be cumbersome for my actual use case which has much more columns in df2).
Any ideas on how this could be done? Would simply deleting duplicates be sufficient to get the desired dataframe (though there appears to be barriers to doing this in h2o, see https://0xdata.atlassian.net/browse/PUBDEV-3292)?
here you go:
import h2o
h2o.init()
df1 = h2o.H2OFrame({'receipt_key': ['a1', 'a2'] , 'b':[1,3] , 'c':[2,4], 'item_id': [1,1]})
df1['receipt_key'] = df1['receipt_key'] .asfactor()
df2 = h2o.H2OFrame({'receipt_key': ['a1', 'a1','a2'] , 'e':[5,7,9] , 'f':[6,8,10], 'item_id': [1,2,1]})
df2['receipt_key'] = df2['receipt_key'].asfactor()
df3 = df1.merge(df2)
df_subset = df3[['receipt_key','b','c','e','f','item_id']]
print(df_subset)
receipt_key b c e f item_id
a1 1 2 5 6 1
a2 3 4 9 10 1

Assign random number by group

I am trying to assign a the same random number to each observation within a group. Thus in the dataset below, the value of the variable "random" would be equal for each observation where gp=B, and would take another value for each observation where gp=A, and so on.
data test ;
input gp $ a b c ;
datalines;
B 2 2 3
B 2 2 3
A 1 2 3
A 1 2 3
C 3 3 4
C 3 3 4
;
Stupidly I tried to create a different seed for each group based upon common unique values to each group:
data test2 ;
set test ;
seed = a*b*c ;
random = ranuni(seed) ;
run ;
This creates a common starting point per group, but which obviously changes for each observation.
How can I obtain a random number equivalent for each observation in the group? Due to the very large size of the real dataset I would like to avoid any sorting or other time consuming processes.
The required datset would thus look something like:
data want ;
input gp $ a b c random ;
datalines;
B 2 2 3 0.123
B 2 2 3 0.123
A 1 2 3 0.456
A 1 2 3 0.456
C 3 3 4 0.789
C 3 3 4 0.789
;
this should do the trick, ask me if you have any questions:
proc sort data=test;
by gp;
run;
data test2;
drop seed;
set test;
by gp;
retain random;
if first.gp then do;
seed = a*b*c ;
random = ranuni(seed) ;
end;
run;
basically, each time you call ranuni you get a new random number, so you only want to call it when the id (gp) changes.

improve the performance of the code with fewer number of operations

There are two vectors:
a = 1:5;
b = 1:2;
in order to find all combinations of these two vectors, I am using the following piece of code:
[A,B] = meshgrid(a,b);
C = cat(2,A',B');
D = reshape(C,[],2);
the result includes all the combinations:
D =
1 1
2 1
3 1
4 1
5 1
1 2
2 2
3 2
4 2
5 2
now the questions:
1- I want to decrease the number of operations to improve the performance for vectors with bigger size. Is there any single function in MATLAB that is doing this?
2- In the case that the number of vectors is more than 2, the meshgrid function cannot be used and has to be replaced with for loops. What is a better solution?
For greater than 2 dimensions, use ndgrid:
>> a = 1:2; b = 1:3; c = 1:2;
>> [A,B,C] = ndgrid(a,b,c);
>> D = [A(:) B(:) C(:)]
D =
1 1 1
2 1 1
1 2 1
2 2 1
1 3 1
2 3 1
1 1 2
2 1 2
1 2 2
2 2 2
1 3 2
2 3 2
Note that ndgrid expects (rows,cols,...) rather than (x,y).
This can be generalized to N dimensions (see here and here):
params = {a,b,c};
vecs = cell(numel(params),1);
[vecs{:}] = ndgrid(params{:});
D = reshape(cat(numel(vecs)+1,vecs{:}),[],numel(vecs));
Also, as described in Robert P.'s answer and here too, kron can also be useful for replicating values (indexes) in this way.
If you have the neural network toolbox, also have a look at combvec, as demonstrated here.
One way would be to combine repmat and the Kronecker tensor product like this:
[repmat(a,size(b)); kron(b,ones(size(a)))]'
ans =
1 1
2 1
3 1
4 1
5 1
1 2
2 2
3 2
4 2
5 2
This can be scaled to more dimensions this way:
a = 1:3;
b = 1:3;
c = 1:3;
x = [repmat(a,1,numel(b)*numel(c)); ...
repmat(kron(b,ones(1,numel(a))),1,numel(c)); ...
kron(c,ones(1,numel(a)*numel(b)))]'
There is a logic! First: simply repeat the first vector. Secondly: Use the tensor product with the dimension of the first vector and repeat it. Third: Use the tensor product with the dimension of (first x second) and repeat (in this case there is not fourth, so no repeat.

R - Improving the performance of a simple loop

I'm a beginner with R, so I'm having trouble thinking of things the "R way"...
I have this function:
upOneRow <- function(table, column) {
for (i in 1:(nrow(table) - 1)) {
table[i, column] = table [i + 1, column]
}
return(table)
}
It seems simple enough, and shouldn't take that long to run, but on a dataframe with ~300k rows, the time it takes to run is unreasonable. What is the right way to approach this?
Instead of the loop you could try something like this:
n <- nrow(table)
table[(1:(n-1)), column] <- table[(2:n), column];
to vectorize is the key
Simple answer: Columns in a data.frame are also vectors which can be indexed with [,]
my.table <- data.frame(x = 1:10, y=10:1)
> my.table
x y
1 1 5
2 2 4
3 3 3
4 4 2
5 5 1
my.table$y <-c(my.table[-1,"y"],NA) #move up one spot and pad with NA
> my.table
x y
1 1 4
2 2 3
3 3 2
4 4 1
5 5 NA
Now you function repeats the last data point at the end. If this is really what you want, pad with tail(x,1) instead of NA.
my.table$y <-c(my.table[-1,"y"],tail(my.table$y,1)) #pad with tail(x,1)
> my.table
x y
1 1 4
2 2 3
3 3 2
4 4 1
5 5 1
If I understand you right, you're trying to "move up" one column of a data frame, with the first element going to the bottom. Then, It might be achieved as:
col <- table[, column]
table[, column] <- col[c(nrow(table), 1:(nrow(table)-1))]

What are some good ways to calculate a score for how difference or close 2 users choices are?

For example, if it is the choice of chocolate, ice cream, donut, ..., for the order of their preference.
If user 1 choose
A B C D E F G H I J
and user 2 chooses
J A B C I G F E D H
what are some good ways to calculate a score from 0 to 100 to tell how close their choices are? It has to make sense, such as if most answers are the same but just 1 or 2 answers different, the score cannot be made to extremely low. Or, if most answers are just "shifted by 1 position", then we cannot count them as "all different" and give 0 score for those differences of only 1 position.
Assign each letter item an integer value starting at 1
A=1, B=2, C=3, D=4, E=5, F=6 (stopping at F for simplicity)
Then consider the order the items are placed, use this as a multiple
So if a number is the first item, its multiplier is 1, if its the 6th item the multipler is 6
Figure out the maximum score you could have (basically when everything is in consecutive order)
item a b c d e f
order 1 2 3 4 5 6
value 1 2 3 4 5 6
score 1 4 9 16 25 36 Sum = 91, Score = 100% (MAX)
item a b d c e f
order 1 2 3 4 5 6
value 1 2 4 3 5 6
score 1 4 12 12 25 36 Sum = 90 Score = 99%
=======================
order 1 2 3 4 5 6
item f d b c e a
value 6 4 2 3 5 1
score 6 8 6 12 25 6 Sum = 63 Score = 69%
order 1 2 3 4 5 6
item d f b c e a
value 4 6 2 3 5 1
score 4 12 6 12 25 6 Sum = 65 Score = 71%
obviously this is a very crude implementation that I just came up with. It may not work for everything. Examples 3 and 4 are swapped by one position yet the score is off by 2% (versus ex 1 and 2 which are off by 1%). It's just a thought. I'm no algorithm expert. You could probably use the final number and do something else to it for a better numerical comparison.
You could
Calculate the edit distance between the sequences;
Subtract the edit distance from the sequence length;
Divide that by the length of the sequence
Multiply it by hundred
Score = 100 * (SequenceLength - Levenshtein( Sequence1, Sequence2 ) ) / SequenceLength
Edit distance is basically the number of operations required to transform sequence one in sequence two. An algorithm therefore is the Levenshtein distance algorithm.
Examples:
Weights
insert: 1
delete: 1
substitute: 1
Seq 1: ABCDEFGHIJ
Seq 2: JABCIGFEDH
Score = 100 * (10-7) / 10 = 30
Seq 1: ABCDEFGHIJ
Seq 2: ABDCFGHIEJ
Score = 100 * (10-3) / 10 = 70
The most straightforward way to calculate it is the Levenshtein distance, which is the number of changes that must be done to transform one string to another.
Disadvantage of Levenshtein distance for your task is that it doesn't measure closeness between products themselves. I.e. you will not know how A and J are close to each other. For example, user 1 may like donuts, and user 2 may like buns, and you know that most people who like first also like the second. From this information you can infer that user 1 makes choices that are close to choices of user 2, through they don't have same elements.
If this is your case, you will have to use one of two: statistical methods to infer correlation between choices or recommendation engines.

Resources