Selecting rows in DataFrame by/after grouping using DataFramesMeta in Julia

Selecting rows in DataFrame by/after grouping using DataFramesMeta in Julia - linq

I am trying to select certain data rows in a DataFrame using #linq macros:
using DataFrames, DataFramesMeta
df=DataFrame(x = ["a", "a", "a", "b", "b", "b"],
y = [1, 2, 3, 2, 3, 4],
z = [100, 200, 300, 456, 345, 234])
| Row | x | y | z |
|-----|-----|---|-----|
| 1 | "a" | 1 | 100 |
| 2 | "a" | 2 | 200 |
| 3 | "a" | 3 | 300 |
| 4 | "b" | 2 | 456 |
| 5 | "b" | 3 | 345 |
| 6 | "b" | 4 | 234 |
I am trying to select those rows that have the maximum y for a given type of x, that is
| Row | x | y | z |
|-----|-----|---|-----|
| 1 | "a" | 3 | 300 |
| 2 | "b" | 4 | 234 |
So, I am grouping by column x and adding a column with the maxima
#linq df |> #by(:x, maxY = maximum(:y))
which gives
| Row | x | maxY |
|-----|-----|------|
| 1 | "a" | 3 |
| 2 | "b" | 4 |
but I don't see how to put the corresponding z entries back in. Probably, it would be join but I don't see how to do that or get the result in another, simple way.

You can do it in one line joining on=[:x,:y] but for this to work you need to name the maximum(:y) column y not maxY:
df2 = #linq df |> by(:x, y=maximum(:y)) |> join(df, on=[:x, :y])
You can later rename that column to the intended maxY:
rename!(df2, :y, :maxY)

Related

Multiply Matrices in DAX

Suppose I have two matrices MatrixA and MatrixB given as follows (where i is the row number and j is the column number:
MatrixA | MatrixB
i | j | val | i | j | val
---|---|---- | ---|---|----
1 | 1 | 3 | 1 | 1 | 2
1 | 2 | 5 | 1 | 2 | 3
1 | 3 | 9 | 2 | 1 | 7
2 | 1 | 2 | 2 | 2 | -1
2 | 2 | 1 | 3 | 1 | 0
2 | 3 | 3 | 3 | 2 | -4
3 | 1 | 3 |
3 | 2 | -1 |
3 | 3 | 2 |
4 | 1 | 0 |
4 | 2 | 7 |
4 | 3 | 6 |
In a more familiar form, they look like this:
MatrixA = 3 5 9 MatrixB = 2 3
2 1 3 7 -1
-1 2 0 0 -4
7 0 6
I'd like to calculate their product (which is demonstrated in this YouTube video):
Product = 41 -32
11 -7
12 -5
14 -3
In the unpivoted column form I used earlier, this is
i | j | val
---|---|----
1 | 1 | 41
1 | 2 | -32
2 | 1 | 11
2 | 2 | -7
3 | 1 | 12
3 | 2 | -5
4 | 1 | 12
4 | 2 | -3
I'm looking for a general calculation that multiplies any compatible k x n and n x m matrices together as a calculated table.

I think I've got it figured out. If MatrixA is k x n and MatrixB is n x m dimensional:
Product =
ADDCOLUMNS(
CROSSJOIN(VALUES(MatrixA[i]), VALUES(MatrixB[j])),
"val",
SUMX(
ADDCOLUMNS(
SELECTCOLUMNS(GENERATESERIES(1, DISTINCTCOUNT(MatrixA[j])), "Index", [Value]),
"A", LOOKUPVALUE(MatrixA[val], MatrixA[i], [i], MatrixA[j], [Index]),
"B", LOOKUPVALUE(MatrixB[val], MatrixB[i], [Index], MatrixB[j], [j])),
[A] * [B]))
The CROSSJOIN creates a new table with columns [i] and [j] which has k x m rows. For each i and j row pair in this cross join table, the value for that cell is computed as the sum product of i row of MatrixA with j column of MatrixB. The GENERATESERIES bit just creates an Index list that has a length of the matching dimension n.
For example, when i = 3 and j = 2, the middle section for the given example is
ADDCOLUMNS(
SELECTCOLUMNS(GENERATESERIES(1, DISTINCTCOUNT(MatrixA[j])), "Index", [Value]),
"A", LOOKUPVALUE(MatrixA[val], MatrixA[i], 3, MatrixA[j], [Index]),
"B", LOOKUPVALUE(MatrixB[val], MatrixB[i], [Index], MatrixB[j], 2))
which generates the table
Index | A | B
------|-----|----
1 | -1 | 3
2 | 2 | -1
3 | 0 | -4
where the [A] column is the 3rd row of MatrixA and the [B] column is the 2nd row of MatrixB.

SQL difference between different columns from different rows

Lets say i have a table as follows:
| id | dir | p1 | p2 |
|----------------------|
| a | x | 1.2 | 1.3 |
| a | x | 1.2 | 1.3 |
| a | z | 2.1 | 3 |
| a | z | 2.1 | 3 |
| b | x | 1 | null|
| b | z | 4 | null|
I would like to have unique rows of row a and b where dir = x and dir = z. So two rows each.
Then when dir = z. Take the value in p1 - (p2 of the previous row for that id) as newval1 and the value in p2 - (p1 of the previous row for that id) as new val2.
Treating nulls as zeroes.
In steps I suppose it will be:
| id | dir | p1 | p2 |
|----------------------|
| a | x | 1.2 | 1.3 |
| a | z | 2.1 | 3 |
| b | x | 1 | null|
| b | z | 4 | null|
Desired result will be:
| id | newval1 | newval2 |
|--------------------------------|
| a | 0.8(2.1-1.3) | 1.8(3-1.2 |
| b | 4 (4-0) | -1(0-1) |
Is it possible to do this in SQL?

select id,
nvl(max(case when dir = 'z' then p1 end), 0)
- nvl(max(case when dir = 'x' then p2 end), 0) as newval1,
nvl(max(case when dir = 'z' then p2 end), 0)
- nvl(max(case when dir = 'x' then p1 end), 0) as newval2
from tbl
where dir in ('x', 'z')
group by id
;
ID NEWVAL1 NEWVAL2
-- ---------- ----------
a .8 1.8
b 4 -1
Or, if you are on version 11.1 or higher, you can use the pivot operator:
select id, z_p1 - x_p2 as newval1, z_p2 - x_p1 as newval2
from tbl
pivot ( max(nvl(p1, 0)) as p1, max(nvl(p2, 0)) as p2
for dir in ('x' as x, 'z' as z)
)
;

Is it possible to do a 'normalized' dense_rank() in hive?

I have a consumer table like so.
consumer | product | quantity
-------- | ------- | --------
a | x | 3
a | y | 4
a | z | 1
b | x | 3
b | y | 5
c | x | 4
What I want is a 'normalized' rank assigned to each consumer so that I can split the table easily for testing and training. I used the dense_rank() in hive, so I got the below table.
rank | consumer | product | quantity
---- | -------- | ------- | --------
1 | a | x | 3
1 | a | y | 4
1 | a | z | 1
2 | b | x | 3
2 | b | y | 5
3 | c | x | 4
This is well and good, but I want to scale this to use with any number of consumers, so I would ideally like the range of ranks between 0 and 1, like so.
rank | consumer | product | quantity
---- | -------- | ------- | --------
0.33 | a | x | 3
0.33 | a | y | 4
0.33 | a | z | 1
0.67 | b | x | 3
0.67 | b | y | 5
1 | c | x | 4
This way, I'd always know what the range of ranks is, and can split the data in a standard way (rank <= 0.7 training, and rank > 0.7 testing)
Is there a way to achieve this in hive?
Or, is there a different and better approach to my original issue of splitting the data?
I tried to do a select * where rank < 0.7*max(rank), but hive says the MAX UDAF is not yet available in where clause.

percent_rank
select percent_rank() over (order by consumer) as pr
,*
from mytable
;
+-----+----------+---------+----------+
| pr | consumer | product | quantity |
+-----+----------+---------+----------+
| 0.0 | a | z | 1 |
| 0.0 | a | y | 4 |
| 0.0 | a | x | 3 |
| 0.6 | b | y | 5 |
| 0.6 | b | x | 3 |
| 1.0 | c | x | 4 |
+-----+----------+---------+----------+
For filtering you'll need a sub-query / CTE
select *
from (select percent_rank() over (order by consumer) as pr
,*
from mytable
) t
where pr <= ...
;

Enumerating Cartesian product while minimizing repetition

Given two sets, e.g.:
{A B C}, {1 2 3 4 5 6}
I want to generate the Cartesian product in an order that puts as much space as possible between equal elements. For example, [A1, A2, A3, A4, A5, A6, B1…] is no good because all the As are next to each other. An acceptable solution would be going "down the diagonals" and then every time it wraps offsetting by one, e.g.:
[A1, B2, C3, A4, B5, C6, A2, B3, C4, A5, B6, C1, A3…]
Expressed visually:
| | A | B | C | A | B | C | A | B | C | A | B | C | A | B | C | A | B | C |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | | | | | | | | | | | | | | | | | |
| 2 | | 2 | | | | | | | | | | | | | | | | |
| 3 | | | 3 | | | | | | | | | | | | | | | |
| 4 | | | | 4 | | | | | | | | | | | | | | |
| 5 | | | | | 5 | | | | | | | | | | | | | |
| 6 | | | | | | 6 | | | | | | | | | | | | |
| 1 | | | | | | | | | | | | | | | | | | |
| 2 | | | | | | | 7 | | | | | | | | | | | |
| 3 | | | | | | | | 8 | | | | | | | | | | |
| 4 | | | | | | | | | 9 | | | | | | | | | |
| 5 | | | | | | | | | | 10| | | | | | | | |
| 6 | | | | | | | | | | | 11| | | | | | | |
| 1 | | | | | | | | | | | | 12| | | | | | |
| 2 | | | | | | | | | | | | | | | | | | |
| 3 | | | | | | | | | | | | | 13| | | | | |
| 4 | | | | | | | | | | | | | | 14| | | | |
| 5 | | | | | | | | | | | | | | | 15| | | |
| 6 | | | | | | | | | | | | | | | | 16| | |
| 1 | | | | | | | | | | | | | | | | | 17| |
| 2 | | | | | | | | | | | | | | | | | | 18|
or, equivalently but without repeating the rows/columns:
| | A | B | C |
|---|----|----|----|
| 1 | 1 | 17 | 15 |
| 2 | 4 | 2 | 18 |
| 3 | 7 | 5 | 3 |
| 4 | 10 | 8 | 6 |
| 5 | 13 | 11 | 9 |
| 6 | 16 | 14 | 12 |
I imagine there are other solutions too, but that's the one I found easiest to think about. But I've been banging my head against the wall trying to figure out how to express it generically—it's a convenient thing that the cardinality of the two sets are multiples of each other, but I want the algorithm to do The Right Thing for sets of, say, size 5 and 7. Or size 12 and 69 (that's a real example!).
Are there any established algorithms for this? I keep getting distracted thinking of how rational numbers are mapped onto the set of natural numbers (to prove that they're countable), but the path it takes through ℕ×ℕ doesn't work for this case.
It so happens the application is being written in Ruby, but I don't care about the language. Pseudocode, Ruby, Python, Java, Clojure, Javascript, CL, a paragraph in English—choose your favorite.
Proof-of-concept solution in Python (soon to be ported to Ruby and hooked up with Rails):
import sys
letters = sys.argv[1]
MAX_NUM = 6
letter_pos = 0
for i in xrange(MAX_NUM):
for j in xrange(len(letters)):
num = ((i + j) % MAX_NUM) + 1
symbol = letters[letter_pos % len(letters)]
print "[%s %s]"%(symbol, num)
letter_pos += 1

String letters = "ABC";
int MAX_NUM = 6;
int letterPos = 0;
for (int i=0; i < MAX_NUM; ++i) {
for (int j=0; j < MAX_NUM; ++j) {
int num = ((i + j) % MAX_NUM) + 1;
char symbol = letters.charAt(letterPos % letters.length);
String output = symbol + "" + num;
++letterPos;
}
}

What about using something fractal/recursive? This implementation divides a rectangular range into four quadrants then yields points from each quadrant. This means that neighboring points in the sequence differ at least by quadrant.
#python3
import sys
import itertools
def interleave(*iters):
for elements in itertools.zip_longest(*iters):
for element in elements:
if element != None:
yield element
def scramblerange(begin, end):
width = end - begin
if width == 1:
yield begin
else:
first = scramblerange(begin, int(begin + width/2))
second = scramblerange(int(begin + width/2), end)
yield from interleave(first, second)
def scramblerectrange(top=0, left=0, bottom=1, right=1, width=None, height=None):
if width != None and height != None:
yield from scramblerectrange(bottom=height, right=width)
raise StopIteration
if right - left == 1:
if bottom - top == 1:
yield (left, top)
else:
for y in scramblerange(top, bottom):
yield (left, y)
else:
if bottom - top == 1:
for x in scramblerange(left, right):
yield (x, top)
else:
halfx = int(left + (right - left)/2)
halfy = int(top + (bottom - top)/2)
quadrants = [
scramblerectrange(top=top, left=left, bottom=halfy, right=halfx),
reversed(list(scramblerectrange(top=top, left=halfx, bottom=halfy, right=right))),
scramblerectrange(top=halfy, left=left, bottom=bottom, right=halfx),
reversed(list(scramblerectrange(top=halfy, left=halfx, bottom=bottom, right=right)))
]
yield from interleave(*quadrants)
if __name__ == '__main__':
letters = 'abcdefghijklmnopqrstuvwxyz'
output = []
indices = dict()
for i, pt in enumerate(scramblerectrange(width=11, height=5)):
indices[pt] = i
x, y = pt
output.append(letters[x] + str(y))
table = [[indices[x,y] for x in range(11)] for y in range(5)]
print(', '.join(output))
print()
pad = lambda i: ' ' * (2 - len(str(i))) + str(i)
header = ' |' + ' '.join(map(pad, letters[:11]))
print(header)
print('-' * len(header))
for y, row in enumerate(table):
print(pad(y)+'|', ' '.join(map(pad, row)))
Outputs:
a0, i1, a2, i3, e0, h1, e2, g4, a1, i0, a3, k3, e1,
h0, d4, g3, b0, j1, b2, i4, d0, g1, d2, h4, b1, j0,
b3, k4, d1, g0, d3, f4, c0, k1, c2, i2, c1, f1, a4,
h2, k0, e4, j3, f0, b4, h3, c4, j2, e3, g2, c3, j4,
f3, k2, f2
| a b c d e f g h i j k
-----------------------------------
0| 0 16 32 20 4 43 29 13 9 25 40
1| 8 24 36 28 12 37 21 5 1 17 33
2| 2 18 34 22 6 54 49 39 35 47 53
3| 10 26 50 30 48 52 15 45 3 42 11
4| 38 44 46 14 41 31 7 23 19 51 27

If your sets X and Y are sizes m and n, and Xi is the index of the element from X that's in the ith pair in your Cartesian product (and similar for Y), then
Xi = i mod n;
Yi = (i mod n + i div n) mod m;
You could get your diagonals a little more spread out by filling out your matrix like this:
for (int i = 0; i < m*n; i++) {
int xi = i % n;
int yi = i % m;
while (matrix[yi][xi] != 0) {
yi = (yi+1) % m;
}
matrix[yi][xi] = i+1;
}

how to find maximum of a group of rows in a certain column which satisfies a condition in tibco spotfire

I want to find maximum of a group of rows in a certain column which satisfies a condition in TIBCO Spotfire. For example, consider the table below:
col 1|col 2|col 3
1 | 2 | y
1 | 3 | y
1 | 6 | y
1 | 8 | n
1 | 7 | n
1 | 6 | y
2 | 2 | y
2 | 10 | y
2 | 6 | y
2 | 9 | n
2 | 7 | y
2 | 6 | n
I want to group all the rows with [col 1] = 1, and find the max of col 2 considering only those rows that have [col 3] = "y".
My final table must look like:
col 1|col 2|col 3|col 4
1 | 2 | y | 6
1 | 3 | y | 6
1 | 6 | y | 6
1 | 8 | n | 6
1 | 7 | n | 6
1 | 6 | y | 6
2 | 2 | y | 10
2 | 10 | y | 10
2 | 6 | y | 10
2 | 9 | n | 10
2 | 7 | y | 10
2 | 6 | n | 10
Can some one please help me out with this?

First(case when [col 3]="y" then Max([col 2]) OVER ([col 1]) end) OVER ([col 1]) should do the trick (version 7.5).
Thanks!

I came up with something that sounds like what you already tried, but here goes.
Insert Calculated Column: CASE WHEN [col 3]="y" THEN Max([col 2]) OVER ([col 1]) END AS [calc]
Insert Calculated Column: Max([calc]) OVER ([col 1]) AS [col 4]
Those give me the value in [col 4] that you were looking for.

#monte_fisto in the similar case can we identify the min and max of a col2

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Selecting rows in DataFrame by/after grouping using DataFramesMeta in Julia - linq

You can do it in one line joining on=[:x,:y] but for this to work you need to name the maximum(:y) column y not maxY: df2 = #linq df |> by(:x, y=maximum(:y)) |> join(df, on=[:x, :y]) You can later rename that column to the intended maxY: rename!(df2, :y, :maxY)

Related

Multiply Matrices in DAX

SQL difference between different columns from different rows

Is it possible to do a 'normalized' dense_rank() in hive?

Enumerating Cartesian product while minimizing repetition

how to find maximum of a group of rows in a certain column which satisfies a condition in tibco spotfire

Categories

Resources