Linq query to select rows where a column is a max value - linq

I'd like to query a database table that looks like the simplified example below:
Quote | Sequence | Item
-------|-----------|-----
1 | 1.0M | a
1 | 2.0M | a
1 | 3.0M | a
1 | 1.0M | b
1 | 2.0M | b
1 | 3.0M | b
2 | 1.0M | x
2 | 2.0M | x
3 | 1.0M | y
and I need a query that gets all rows for a given Quote where the Sequence is the max value for that column:
Quote | Sequence | Item
-------|-----------|-----
1 | 3.0M | a
1 | 3.0M | b
2 | 2.0M | x
3 | 1.0M | y
I'm using F# and System.Data.Linq.
I can use
let quoteQuery =
query{
for row in db.[TABLE] do
select row
}
to get all rows, but I don't know Linq well enough--yet--to modify this to have the query that will produce the desired results. I've tried using the answer from this question in an attempt to modify my query, but I've hit a wall in trying to modify (guess?) the syntax/language necessary.
There are several SQL examples I can find, but few that are Linq-specific.

As are hinted in comments, this is not Linq, but f# query expressions.
And that is in fact not really what this question is about after all.
Its more set and relational algebra. Or something...
That said: The thing here is that if you group by and then get the max element of each group then you are good to good. Mind that the example code does not work against any DB or otherwise, but that should be rather easily replaceable.
type Table =
{
Quote:int
Sequence: decimal
Item: string
}
let createTableEntry (q,s,i) =
{
Quote = q
Sequence = s
Item = i
}
let printTR {Quote=q;Sequence=s;Item=i} = printfn "%A | %A | %A" q s i
let table =
[
(1 , 1.0M , "a")
(1 , 2.0M , "a")
(1 , 3.0M , "a")
(1 , 1.0M , "b")
(1 , 2.0M , "b")
(1 , 3.0M , "b")
(2 , 1.0M , "x")
(2 , 2.0M , "x")
(3 , 1.0M , "y")
]
|> List.map createTableEntry
let result =
table
|> List.groupBy (fun x -> x.Quote, x.Item) //group "unique" by Quote&Item
|> List.map (fun x -> snd x |> List.max) //get max of each group, i.e. max of Sequence
result |> Seq.iter printTR
1 | 3.0M | "a"
1 | 3.0M | "b"
2 | 2.0M | "x"
3 | 1.0M | "y"
Addendum
after Ivan Stoev had answered partially "wrongly". Here is a "corrected" version, which does the "same" (not really same, but ...) as the above:
let quoteQuery =
query {
for row in table do
groupBy (row.Quote, row.Item) into g
let maxRow =
query {
for row in g do
sortBy row.Sequence
headOrDefault
}
select maxRow
}
quoteQuery |> Seq.iter printTR
Addendum II
Since I edited and said that Ivans answer was not really the same as the first code example, I have also added one that is "exactly the same" with query expressions:
let quoteQuery' =
query {
for row in table do
groupBy (row.Quote, row.Item) into g
let maxRow =
query {
for row in g do
maxBy row.Sequence
}
select (fst g.Key, maxRow, snd g.Key)
}|>Seq.map createTableEntry
quoteQuery' |> Seq.iter printTR

F# Query expression are called query expression because they follow (losely) LINQ query syntax (vs method syntax) and they do implement Linq.IQueriable. So I also think about them as LINQ. While it maybe not idiomatic the above query can be rewritten in method syntax. If you nuget MoreLinq, it's quite succint. "Stealing" the above table definition from #HelgeReneUrholm:
open MoreLinq
table
.GroupBy(fun x -> (x.Quote,x.Item))
.Select(fun x -> x.MaxBy(fun x -> x.Sequence)) |> Seq.iter printTR
1 | 3.0M | "a"
1 | 3.0M | "b"
2 | 2.0M | "x"
3 | 1.0M | "y"

Related

Question about the behavior of the fortran function MATMUL()

I am a newbie in fortran and i have to multiply matrices of different shapes with MATMUL() and the result is not what i expected...
Here is my fortran code:
integer, dimension(3,2) :: a
integer, dimension(2,2) :: b
integer :: i, j
a = reshape((/ 1, 1, 1, 1, 1, 1 /), shape(a))
b = MATMUL(a,TRANSPOSE(a))
do j = 1, 2
do i = 1, 2
print*, b(i, j)
end do
end do
I expected this matrix as a result:
b =
| 3 3 | , a 2x2 matrix
| 3 3 |
Instead, i got this error message:
matmlt.f90(9): error #6366: The shapes of the array expressions do not conform. [B]
b = MATMUL(a,TRANSPOSE(a))
------^
To make this code work properly i had to switch the MATMUL arguments like this:
b = MATMUL(TRANSPOSE(a), a)
And this way, i obtain what i was expecting at the beginning. But this is not intuitive.
On paper,
a =
| 1 1 1 |
| 1 1 1 |
transpose(a) =
| 1 1 |
| 1 1 |
| 1 1 |
a x transpose(a) =
| 3 3 |
| 3 3 |
and
transpose(a) x a =
| 2 2 2 |
| 2 2 2 |
| 2 2 2 |
What is wrong with my code?
Thank you.
your matrix definition for the variable
integer, dimension(3,2) :: a
means, that you have 3 rows and 2 cols (different ofyour assumption). Subsequently
a=
|11|
|11|
|11|
and
transpose(a) = |111||111|
matmul(a,transpose(a)) =
|2 2 2|
|2 2 2|
|2 2 2|
so your variable b should defined like
integer, dimension (3,3) :: b
instead of
integer, dimension (2,2) :: b
what is the reason of the
matmlt.f90(9): error #6366: The shapes of the array expressions do not conform. [B] b = MATMUL(a,TRANSPOSE(a)) ------^
Error

How do I get one row for every Min or Max on every column of a dataframe in Pyspark efficiently?

I'm trying to reduce a big dataset to rows having minimum and maximum values for each column. In other words, I would like, for every column of this dataset to get one row that has the minimum value on that column, as well as another that has the maximum value on the same column. I should mention that I do not know in advance what columns this dataset will have. Here's an example:
+----+----+----+ +----+----+----+
|Col1|Col2|Col3| ==> |Col1|Col2|Col3|
+----+----+----+ +----+----+----+
| F | 99 | 17 | | A | 34 | 25 |
| M | 32 | 20 | | Z | 51 | 49 |
| D | 2 | 84 | | D | 2 | 84 |
| H | 67 | 90 | | F | 99 | 17 |
| P | 54 | 75 | | C | 18 | 9 |
| C | 18 | 9 | | H | 67 | 90 |
| Z | 51 | 49 | +----+----+----+
| A | 34 | 25 |
+----+----+----+
The first row is selected because A is the smallest value on Col1. The second because Z is the largest value on Col1. The third because 2 is the smallest on Col2, and so on. The code below seems to do the right thing (correct me if I'm wrong), but performance is sloooow. I start with getting a dataframe from a random .csv file:
input_file = (sqlContext.read
.format("csv")
.options(header="true", inferSchema="true", delimiter=";", charset="UTF-8")
.load("/FileStore/tables/random.csv")
)
Then I create two other dataframes that each have one row with the min and respectively, max values of each column:
from pyspark.sql.functions import col, min, max
min_values = input_file.select(
*[min(col(col_name)).name(col_name) for col_name in input_file.columns]
)
max_values = input_file.select(
*[max(col(col_name)).name(col_name) for col_name in input_file.columns]
)
Finally, I repeatedly join the original input file to these two dataframes holding minimum and maximum values, using every column in turn, and do a union between all the results.
min_max_rows = (
input_file
.join(min_values, input_file[input_file.columns[0]] == min_values[input_file.columns[0]])
.select(input_file["*"]).limit(1)
.union(
input_file
.join(max_values, input_file[input_file.columns[0]] == max_values[input_file.columns[0]])
.select(input_file["*"]).limit(1)
)
)
for c in input_file.columns[1:]:
min_max_rows = min_max_rows.union(
input_file
.join(min_values, input_file[c] == min_values[c])
.select(input_file["*"]).limit(1)
.union(
input_file
.join(max_values, input_file[c] == max_values[c])
.select(input_file["*"]).limit(1)
)
)
min_max_rows.dropDuplicates()
For my test dataset of 500k rows, 40 columns, doing all this takes about 7-8 minutes on a standard Databricks cluster. I'm supposed to sift through more than 20 times this amount of data regularly. Is there any way to optimize this code? I'm quite afraid I've taken the naive approach to it, since I'm quite new to Spark.
Thanks!
Does not seem to be a popular question, but interesting (for me). And a lot of work for 15 pts. In fact I got it wrong first time round.
Here is a scaleable solution that you can partition accordingly to increase throughput.
Hard to explain, manipulation of the data and transposing the data is
the key issue here - and some lateral thinking.
I did not focus on variable columns all sorts of data types. That needs to be solved by yourself, can be done but some if else logic required to check if alpha or double or numeric. Mixing data types and applying to stuff gets problematic, but can be solved. I gave a notion of num_string, but did not complete that.
I have focused on the scalability issue and approach, with less procedural logic. Smaller sample size with all numbers, but correct as now as far as I can see. General principle is there.
Try it. Success.
Code:
from pyspark.sql.functions import *
from pyspark.sql.types import *
def reshape(df, by):
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
df1 = spark.createDataFrame(
[(4, 15, 3), (200, 100, 25), (7, 16, 4)], ("c1", "c2", "c3"))
df1 = df1.withColumn("rowId", monotonically_increasing_id())
df1.cache
df1.show()
df2 = reshape(df1, ["rowId"])
df2.show()
# In case you have other types like characters in the other column - not focusing on that aspect
df3 = df2.withColumn("num_string", format_string("%09d", col("val")))
# Avoid column name issues.
df3 = df3.withColumn("key1", col("key"))
df3.show()
df3 = df3.groupby('key1').agg(min(col("val")).alias("min_val"), max(col("val")).alias("max_val"))
df3.show()
df4 = df3.join(df2, df3.key1 == df2.key)
new_column_condition = expr(
"""IF(val = min_val, -1, IF(val = max_val, 1, 0))"""
)
df4 = df4.withColumn("col_class", new_column_condition)
df4.show()
df5 = df4.filter( '(min_val = val or max_val = val) and col_class <> 0' )
df5.show()
df6 = df5.join(df1, df5.rowId == df1.rowId)
df6.show()
df6.select([c for c in df6.columns if c in ['c1','c2', 'c3']]).distinct().show()
Returns:
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 4| 15| 3|
|200|100| 25|
+---+---+---+
Data wrangling the clue here.

Neo4j very slowly using shortestPath

I'm trying to improve the speed of query below. She is returning the data in 9 seconds. If I remove the shortestPath, the time drops to 1.5 seconds.
Does anyone know what might be wrong with my query or how to optimize shartestPath?
It's a single query:
MATCH (currentUser:Packer {UUID:'19443'})-[:I_Follow*0..1]->followers-[rf:Has_Backpack|Has_Contribution*0..1]->(e)
Match (e)-[rp:Has_Pocket|Has_Document*0..]->d
Match d-[rn:Say_Thanks|I_Follow|I_Favorite_Follow|I_Favorite*0..1]->a
with distinct currentUser,followers, a, last(rf + rp + rn) as l
Optional match shortestPath(currentUser-[:Has_Group|Has_Shared_To_Collaboration|Hub_Shared|Has_Shared|Has_Backpack|Has_Pocket|Has_Document]->a)
with followers, a, count(a) as num,l<br><br>
OPTIONAL MATCH a-[:Hub_Comments]->()-[rf:Has_Comment]->comments
WITH followers, a, l, collect(comments)[0..3] as coll,count(comments) as totalComments,num
MATCH parent-[l]->a where (num > 0 or a.Permission <> 'Private') with followers, a, parent, l, coll, totalComments order by l.Datecreate desc skip 0 limit 10
Match (owner:Packer {Username:a.Createdby})<br>
return followers, a, parent, l, coll, totalComments, owner
Using the profile have this data:
Operator | Rows | DbHits | Identifiers
Extract (0) | 3731 | 7462 |
PatternMatcher (0) | 3731 | 8386 | parent, a, l |
Filter | 3735 | 7470 | | (a> {} AUTOINT3 OR NOT (Property (a, Permission (10)) == {AUTOSTRING4})) |
Total Accesses database: 23386
Version: 2.1.6
nodes: 175,563
properties: 468 402
relationships: 155,284
relationship types: 38
database disk: 780 MB
usage: 2 MB

Sample time serie by time interval with Hive QL and calculate jumps

I have time series data in a table. Basically each row has a timestamp and a value.
The frequency of the data is absolutely random.
I'd like to sample it with a given frequency and for each frequency extract relevant information about it: min, max, last, change (relative previous), return (change / previous) and maybe more (count...)
So here's my input:
08:00:10, 1
08:01:20, 2
08:01:21, 3
08:01:24, 5
08:02:24, 2
And I'd like to get the following result for 1 minute sampling (ts, min, max, last, change, return):
ts m M L Chg Return
08:01:00, 1, 1, 1, NULL, NULL
08:02:00, 2, 5, 5, 4, 4
08:03:00, 2, 2, 2, -3, -0.25
You could do it with something like this (comments inline):
SELECT
min
, mn
, mx
, l
, l - LAG(l, 1) OVER (ORDER BY min) c
-- This might not be the right calculation. Unsure how -0.25 was derived in question.
, (l - LAG(l, 1) OVER (ORDER BY min)) / (LAG(l, 1) OVER (ORDER BY min)) r
FROM
(
SELECT
min
, MIN(val) mn
, MAX(val) mx
-- We can take MAX here because all l's (last values) for the minute are the same.
, MAX(l) l
FROM
(
SELECT
min
, val
-- The last value of the minute, ordered by the timestamp, using all rows.
, LAST_VALUE(val) OVER (PARTITION BY min ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) l
FROM
(
SELECT
ts
-- Drop the seconds and go back one minute by converting to seconds,
-- subtracting 60, and then going back to a shorter string format.
-- 2000-01-01 is a dummy date just to enable the conversion.
, CONCAT(FROM_UNIXTIME(UNIX_TIMESTAMP(CONCAT("2000-01-01 ", ts), "yyyy-MM-dd HH:mm:ss") + 60, "HH:mm"), ":00") min
, val
FROM
-- As from the question.
21908430_input a
) val_by_min
) val_by_min_with_l
GROUP BY min
) min_with_l_m_M
ORDER BY min
;
Result:
+----------+----+----+---+------+------+
| min | mn | mx | l | c | r |
+----------+----+----+---+------+------+
| 08:01:00 | 1 | 1 | 1 | NULL | NULL |
| 08:02:00 | 2 | 5 | 5 | 4 | 4 |
| 08:03:00 | 2 | 2 | 2 | -3 | -0.6 |
+----------+----+----+---+------+------+

How to find the maximum number of matching data?

Given a bidimensionnal array such as:
-----------------------
| | 1 | 2 | 3 | 4 | 5 |
|-------------------|---|
| 1 | X | X | O | O | X |
|-------------------|---|
| 2 | O | O | O | X | X |
|-------------------|---|
| 3 | X | X | O | X | X |
|-------------------|---|
| 4 | X | X | O | X | X |
-----------------------
I have to find the largest set of cells currently containing O with a maximum of one cell per row and one per column.
For instance, in the previous example, the optimal answer is 3, when:
row 1 goes with column 4;
row 2 goes with column 1 (or 2);
row 3 (or 4) goes with column 3.
It seems that I have to find an algorithm in O(CR) (where C is the number of columns and R the number of rows).
My first idea was to sort the rows in ascending order according to its number on son. Here is how the algorithm would look like:
For i From 0 To R
For j From 0 To N
If compatible(i, j)
add(a[j], i)
Sort a according to a[j].size
result = 0
For i From 0 To N
For j From 0 to a[i].size
if used[a[i][j]] = false
used[a[i][j]] = true
result = result + 1
break
Print result
Altough I didn't find any counterexample, I don't know whether it always gives the optimal answer.
Is this algorithm correct? Is there any better solution?
Going off Billiska's suggestion, I found a nice implementation of the "Hopcroft-Karp" algorithm in Python here:
http://code.activestate.com/recipes/123641-hopcroft-karp-bipartite-matching/
This algorithm is one of several that solves the maximum bipartite matching problem, using that code exactly "as-is" here's how I solved example problem in your post (in Python):
from collections import defaultdict
X=0; O=1;
patterns = [ [ X , X , O , O , X ],
[ O , O , O , X , X ],
[ X , X , O , X , X ],
[ X , X , O , X , X ]]
G = defaultdict(list)
for i, x in enumerate(patterns):
for j, y in enumerate(patterns):
if( patterns[i][j] ):
G['Row '+str(i)].append('Col '+str(j))
solution = bipartiteMatch(G) ### function defined in provided link
print len(solution[0]), solution[0]

Resources