Select only first rows in each h2o dataframe group_by group (for merging)? - h2o

Is there a way to select only first rows in each h2o dataframe group_by group?
The reason for doing this is to merge some columns in an h2o dataframe into a group_by'ed version of that dataframe that was created to get some stats. based on particular groupings in the original.
Example, suppose had two dataframes like
df1
receipt_key b c item_id
------------------------
a1 1 2 1
a2 3 4 1
and
df2
receipt_key e f item_id
--------------------------
a1 5 6 1
a1 7 8 2
a2 9 10 1
would like to join them such that end up with dataframe
df3
receipt_key b c e f item_id
-----------------------------
a1 1 2 5 6 1
a2 3 4 9 10 1
Have tried doing something like df2.group_by('receipt_key').max('item_id') to merge into df1, but doing so only leaves the item_id column in the group's get_frame() dataframe (and even listing all of the columns in df2 to max() on would not give the right values as well as be cumbersome for my actual use case which has much more columns in df2).
Any ideas on how this could be done? Would simply deleting duplicates be sufficient to get the desired dataframe (though there appears to be barriers to doing this in h2o, see https://0xdata.atlassian.net/browse/PUBDEV-3292)?

here you go:
import h2o
h2o.init()
df1 = h2o.H2OFrame({'receipt_key': ['a1', 'a2'] , 'b':[1,3] , 'c':[2,4], 'item_id': [1,1]})
df1['receipt_key'] = df1['receipt_key'] .asfactor()
df2 = h2o.H2OFrame({'receipt_key': ['a1', 'a1','a2'] , 'e':[5,7,9] , 'f':[6,8,10], 'item_id': [1,2,1]})
df2['receipt_key'] = df2['receipt_key'].asfactor()
df3 = df1.merge(df2)
df_subset = df3[['receipt_key','b','c','e','f','item_id']]
print(df_subset)
receipt_key b c e f item_id
a1 1 2 5 6 1
a2 3 4 9 10 1

Related

Sorting/ordering values from smallest to biggest in an array

I have a formula like this : =ArrayFormula(sort(INDEX($B$1:$B$10,MATCH(E1,$A$1:$A$10,0))))
in columns A:B:
a 1
b 2
c 3
d 4
e 5
f 6
g 7
h 8
i 9
j 10
and
the data to convert in E:H
a c f e
f a c b
b a c d
I get the following results using the above formula
in columns L:O:
1 3 6 5
6 1 3 2
2 1 3 4
My desired output is like this:
1 3 5 6
1 2 3 6
1 2 3 4
I'd like to arrange the numbers from smallest to biggest in value. I can do this with additional helper cells. but if possible i'd like to get the same result without any additional cells. can i get a little help please? thanks.
To sort by row, use SORT BYROW. But unfortunately, nested array results aren't supported in BYROW. So, we need to JOIN and SPLIT the resulting array.
=ARRAYFORMULA(SPLIT(BYROW(your_formula,LAMBDA(row,JOIN("🌆",SORT(TRANSPOSE(row))))),"🌆"))
Here's another way using Makearray with Index to get the current row and Small to get the smallest, next smallest etc. within the row:
=ArrayFormula(makearray(3,4,lambda(r,c,small(index(vlookup(E1:H3,A1:B10,2,false),r,0),c))))
Or you could change the order (might be a little faster) as you don't need to vlookup the entire array, just the current row:
=ArrayFormula(makearray(3,4,lambda(r,c,small(vlookup(index(E1:H3,r,0),A1:B10,2,false),c))))
It's interesting (to me at any rate) that you can interrogate the row and column number of the current cell using Map or Scan, so this is also possible:
=ArrayFormula(map(E1:H3,lambda(cell,small(vlookup(index(E1:H3,row(cell),0),A1:B10,2,false),column(cell)-column(E:E)+1))))
Thanks to #JvdV for this insight (which may be obvious to some but wasn't to me) shown here in Excel.
try:
=INDEX(TRIM(SPLIT(FLATTEN(QUERY(QUERY(QUERY(SPLIT(FLATTEN(E1:H3&"×​"&ROW(E1:H3)), "​"),
"select max(Col1) group by Col1 pivot Col2"), "offset 1", 0),,9^9)), "×")))
or if you want numbers:
=INDEX(IFNA(VLOOKUP(TRIM(SPLIT(FLATTEN(QUERY(QUERY(QUERY(SPLIT(FLATTEN(E1:H3&"×​"&ROW(E1:H3)), "​"),
"select max(Col1) group by Col1 pivot Col2"), "offset 1", 0),,9^9)), "×")), A:B, 2, 0)))

Multiple sorting conditions in DolphinDB

Suppose I have a table as follows:
id=`A`B`A`B`B`B`A
item= 10 1 1 3 5 10 6
t=table(id,item)
id item
-- ----
A 10
B 1
A 1
B 3
B 5
B 10
A 6
For example, I want to sort the table with two conditions: first, by the most commonly occurring item in column item, then by the highest number in column item.
How can I sort like this:
id item
--- ----
A 10
B 10
A 1
B 1
A 6
B 5
B 3
Is there any way to go about this? Thanks!
Try this:
t1=table(id,item);
update t1 set count=count(item) context by item;
select * from t1 order by count desc, item desc;

Pandas multiindex sort

In Pandas 0.19 I have a large dataframe with a Multiindex of the following form
C0 C1 C2
A B
bar one 4 2 4
two 1 3 2
foo one 9 7 1
two 2 1 3
I want to sort bar and foo (and many more double lines as them) according to "two" to get the following:
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3
I am interested in speed (as I have many columns and many pairs of rows). I am also happy with re-arranging the data if it speeds up the sorting. Many thanks
Here is a mostly numpy solution that should yield good performance. It first selects only the 'two' rows and argsorts them. It then sets this order for each row of the original dataframe. It then unravels this order (after adding a constant to offset each row) and the original dataframe values. It then reorders all the original values based on this unraveled, offset and argsorted array before creating a new dataframe with the intended sort order.
rows, cols = df.shape
df_a = np.argsort(df.xs('two', level=1))
order = df_a.reindex(df.index.droplevel(-1)).values
offset = np.arange(len(df)) * cols
order_final = order + offset[:, np.newaxis]
pd.DataFrame(df.values.ravel()[order_final.ravel()].reshape(rows, cols), index=df.index, columns=df.columns)
Output
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3
Some Speed tests
# create much larger frame
import string
idx = pd.MultiIndex.from_product((list(string.ascii_letters), list(string.ascii_letters) + ['two']))
df1 = pd.DataFrame(index=idx, data=np.random.rand(len(idx), 3), columns=['C0', 'C1', 'C2'])
#scott boston
%timeit df1.groupby(level=0).apply(sortit)
10 loops, best of 3: 199 ms per loop
#Ted
1000 loops, best of 3: 5 ms per loop
Here is a solution, albeit klugdy:
Input dataframe:
C0 C1 C2
A B
bar one 4 2 4
two 1 3 2
foo one 9 7 1
two 2 1 3
Custom sorting function:
def sortit(x):
xcolumns = x.columns.values
x.index = x.index.droplevel()
x.sort_values(by='two',axis=1,inplace=True)
x.columns = xcolumns
return x
df.groupby(level=0).apply(sortit)
Output:
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3

sorting a dataframe by values and storing index and columns

I have a pandas DataFrame which is actually a matrix. It looks as shown below
a b c
d 1 0 5
e 0 6 2
f 2 0 3
I need the values to be sorted and need the values of index and columns of them. the result should be
index Column Value
e b 6
d c 5
f c 3
You need stack for reshape with nlargest:
df1 = df.stack().nlargest(3).rename_axis(['idx','col']).reset_index(name='val')
print (df1)
idx col val
0 e b 6
1 d c 5
2 f c 3
For MultiIndex:
df2 = df.stack().nlargest(3).to_frame(name='val')
print (df2)
val
e b 6
d c 5
f c 3

Natural Join of different tables

Could you please explain to me how to do a NATURAL JOIN on these two relations (one having 5 and the other one 3 rows?
1st relation
A C
3 3
6 4
2 3
3 5
7 1
2nd relation
B C D
5 1 6
1 5 8
4 3 9
In your question you have two separate relations, which have one attribute (i.e. column) in common: C.
A natural join will combine all tuples in both relations with that attribute in common. You will end up with the results:
A B C D
7 5 1 6
3 4 3 9
2 4 3 9
3 1 5 8
This can be performed in SQL by using the code #Matthew posted.
Something like:
SELECT * FROM 1stRelation NATURAL JOIN 2ndReleation
It will do the same thing and an inner join using the explicit column names. I.e.:
SELECT * from 1stRelation as x INNER JOIN 2ndRelation as z ON x.C=z.C
Personally - I prefer not to use them except in the possible case where I am not aware of the table structure in advance but know they should be able to be joined.
Basicly you do a CROSS JOIN, i. e. you combine every row from the 1st relation with every row of the 2nd relation. Then you have two C columns. Now you eliminate every row where the two C are not equal and merge them as only one column C.

Resources