Related
I'm trying to reduce a big dataset to rows having minimum and maximum values for each column. In other words, I would like, for every column of this dataset to get one row that has the minimum value on that column, as well as another that has the maximum value on the same column. I should mention that I do not know in advance what columns this dataset will have. Here's an example:
+----+----+----+ +----+----+----+
|Col1|Col2|Col3| ==> |Col1|Col2|Col3|
+----+----+----+ +----+----+----+
| F | 99 | 17 | | A | 34 | 25 |
| M | 32 | 20 | | Z | 51 | 49 |
| D | 2 | 84 | | D | 2 | 84 |
| H | 67 | 90 | | F | 99 | 17 |
| P | 54 | 75 | | C | 18 | 9 |
| C | 18 | 9 | | H | 67 | 90 |
| Z | 51 | 49 | +----+----+----+
| A | 34 | 25 |
+----+----+----+
The first row is selected because A is the smallest value on Col1. The second because Z is the largest value on Col1. The third because 2 is the smallest on Col2, and so on. The code below seems to do the right thing (correct me if I'm wrong), but performance is sloooow. I start with getting a dataframe from a random .csv file:
input_file = (sqlContext.read
.format("csv")
.options(header="true", inferSchema="true", delimiter=";", charset="UTF-8")
.load("/FileStore/tables/random.csv")
)
Then I create two other dataframes that each have one row with the min and respectively, max values of each column:
from pyspark.sql.functions import col, min, max
min_values = input_file.select(
*[min(col(col_name)).name(col_name) for col_name in input_file.columns]
)
max_values = input_file.select(
*[max(col(col_name)).name(col_name) for col_name in input_file.columns]
)
Finally, I repeatedly join the original input file to these two dataframes holding minimum and maximum values, using every column in turn, and do a union between all the results.
min_max_rows = (
input_file
.join(min_values, input_file[input_file.columns[0]] == min_values[input_file.columns[0]])
.select(input_file["*"]).limit(1)
.union(
input_file
.join(max_values, input_file[input_file.columns[0]] == max_values[input_file.columns[0]])
.select(input_file["*"]).limit(1)
)
)
for c in input_file.columns[1:]:
min_max_rows = min_max_rows.union(
input_file
.join(min_values, input_file[c] == min_values[c])
.select(input_file["*"]).limit(1)
.union(
input_file
.join(max_values, input_file[c] == max_values[c])
.select(input_file["*"]).limit(1)
)
)
min_max_rows.dropDuplicates()
For my test dataset of 500k rows, 40 columns, doing all this takes about 7-8 minutes on a standard Databricks cluster. I'm supposed to sift through more than 20 times this amount of data regularly. Is there any way to optimize this code? I'm quite afraid I've taken the naive approach to it, since I'm quite new to Spark.
Thanks!
Does not seem to be a popular question, but interesting (for me). And a lot of work for 15 pts. In fact I got it wrong first time round.
Here is a scaleable solution that you can partition accordingly to increase throughput.
Hard to explain, manipulation of the data and transposing the data is
the key issue here - and some lateral thinking.
I did not focus on variable columns all sorts of data types. That needs to be solved by yourself, can be done but some if else logic required to check if alpha or double or numeric. Mixing data types and applying to stuff gets problematic, but can be solved. I gave a notion of num_string, but did not complete that.
I have focused on the scalability issue and approach, with less procedural logic. Smaller sample size with all numbers, but correct as now as far as I can see. General principle is there.
Try it. Success.
Code:
from pyspark.sql.functions import *
from pyspark.sql.types import *
def reshape(df, by):
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
df1 = spark.createDataFrame(
[(4, 15, 3), (200, 100, 25), (7, 16, 4)], ("c1", "c2", "c3"))
df1 = df1.withColumn("rowId", monotonically_increasing_id())
df1.cache
df1.show()
df2 = reshape(df1, ["rowId"])
df2.show()
# In case you have other types like characters in the other column - not focusing on that aspect
df3 = df2.withColumn("num_string", format_string("%09d", col("val")))
# Avoid column name issues.
df3 = df3.withColumn("key1", col("key"))
df3.show()
df3 = df3.groupby('key1').agg(min(col("val")).alias("min_val"), max(col("val")).alias("max_val"))
df3.show()
df4 = df3.join(df2, df3.key1 == df2.key)
new_column_condition = expr(
"""IF(val = min_val, -1, IF(val = max_val, 1, 0))"""
)
df4 = df4.withColumn("col_class", new_column_condition)
df4.show()
df5 = df4.filter( '(min_val = val or max_val = val) and col_class <> 0' )
df5.show()
df6 = df5.join(df1, df5.rowId == df1.rowId)
df6.show()
df6.select([c for c in df6.columns if c in ['c1','c2', 'c3']]).distinct().show()
Returns:
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 4| 15| 3|
|200|100| 25|
+---+---+---+
Data wrangling the clue here.
I need a function that given an input of this style:
printMatrix(N, M)
where N is an integer and M a list of integers:
printMatrix(3, [1,5,8,9 ...]).
Where 3 is the number of rows and columns of a board.
The successive integers mark the box number (1 would be equivalent to position 1 (row), 1 (column) of the matrix, 2 the 1,2, 3 the 1,3, 4 the 2,1, 5 the 2 , 2, 6 the 2.3, 7 the 3.1, 8 the 3.2 and 9 the 3.3). You have to paint an 'X' for each number that appears in the list.
The output sought in this example would be the following:
-------
| X | | |
-------
| | X | |
-------
| | X | X |
I don't even know how to start, any help is welcomed.
You can compute the index of each box easily. Then iterate over every row and every column and check whether its corresponding index is in the input list, printing the required value along the way (either a 'X' or a ' ', also interleaving the box's walls):
printMatrix(N, M):-
Width is 2*N+1,
format('~`-t~*|', [Width]),
forall(between(1,N,Row),
(
SBase is N*(Row-1)+1,
EBase is N*Row,
nl,
write('|'),
forall(between(SBase,EBase,Item),
(
(memberchk(Item, M)->write('X');write(' ')),
write('|')
))
)),
nl,
format('~`-t~*|', [Width]).
Here I use forall/2 and between/3 predicates to iterate over rows and columns, and memberchk/2 to see if the item is in the list.
Sample output:
?- printMatrix(3,[1,4,5,9]).
-------
|X| | |
|X|X| |
| | |X|
-------
You can start from something like from this basic solution where the matrix has elements from 0 to 8 in case of 3x3:
test:-
% -------
% | 0 | 1 | 2 |
% | 3 | 4 | 5 |
% | 6 | 7 | 8 |
N = 3,
L = [1,5,7],
NN is N*N,
write('|'),
loop(0,NN,N,L).
check_newline(I,_,I):- !.
check_newline(V,Mod,Max):-
V < Max, !,
( 0 =:= mod(V,Mod) -> nl,write('|'); true).
loop(H,H,_,_):- !.
loop(I,Max,N,[]):-
I < Max, !,
write('_|'),
I1 is I+1,
check_newline(I1,N,Max),
loop(I1,Max,N,[]).
loop(H,Max,N,[H|T]):-
H < Max, !,
write('X|'),
H1 is H+1,
check_newline(H1,N,Max),
loop(H1,Max,N,T).
loop(H,Max,N,[E|T]):-
H < Max, !,
H \= E,
write('_|'),
H1 is H+1,
check_newline(H1,N,Max),
loop(H1,Max,N,[E|T]).
?- test.
|_|X|_|
|_|_|X|
|_|X|_|
true
Then you can complicate the code and do all sort of fancy stuff and maybe write shorter code.
I'd like to query a database table that looks like the simplified example below:
Quote | Sequence | Item
-------|-----------|-----
1 | 1.0M | a
1 | 2.0M | a
1 | 3.0M | a
1 | 1.0M | b
1 | 2.0M | b
1 | 3.0M | b
2 | 1.0M | x
2 | 2.0M | x
3 | 1.0M | y
and I need a query that gets all rows for a given Quote where the Sequence is the max value for that column:
Quote | Sequence | Item
-------|-----------|-----
1 | 3.0M | a
1 | 3.0M | b
2 | 2.0M | x
3 | 1.0M | y
I'm using F# and System.Data.Linq.
I can use
let quoteQuery =
query{
for row in db.[TABLE] do
select row
}
to get all rows, but I don't know Linq well enough--yet--to modify this to have the query that will produce the desired results. I've tried using the answer from this question in an attempt to modify my query, but I've hit a wall in trying to modify (guess?) the syntax/language necessary.
There are several SQL examples I can find, but few that are Linq-specific.
As are hinted in comments, this is not Linq, but f# query expressions.
And that is in fact not really what this question is about after all.
Its more set and relational algebra. Or something...
That said: The thing here is that if you group by and then get the max element of each group then you are good to good. Mind that the example code does not work against any DB or otherwise, but that should be rather easily replaceable.
type Table =
{
Quote:int
Sequence: decimal
Item: string
}
let createTableEntry (q,s,i) =
{
Quote = q
Sequence = s
Item = i
}
let printTR {Quote=q;Sequence=s;Item=i} = printfn "%A | %A | %A" q s i
let table =
[
(1 , 1.0M , "a")
(1 , 2.0M , "a")
(1 , 3.0M , "a")
(1 , 1.0M , "b")
(1 , 2.0M , "b")
(1 , 3.0M , "b")
(2 , 1.0M , "x")
(2 , 2.0M , "x")
(3 , 1.0M , "y")
]
|> List.map createTableEntry
let result =
table
|> List.groupBy (fun x -> x.Quote, x.Item) //group "unique" by Quote&Item
|> List.map (fun x -> snd x |> List.max) //get max of each group, i.e. max of Sequence
result |> Seq.iter printTR
1 | 3.0M | "a"
1 | 3.0M | "b"
2 | 2.0M | "x"
3 | 1.0M | "y"
Addendum
after Ivan Stoev had answered partially "wrongly". Here is a "corrected" version, which does the "same" (not really same, but ...) as the above:
let quoteQuery =
query {
for row in table do
groupBy (row.Quote, row.Item) into g
let maxRow =
query {
for row in g do
sortBy row.Sequence
headOrDefault
}
select maxRow
}
quoteQuery |> Seq.iter printTR
Addendum II
Since I edited and said that Ivans answer was not really the same as the first code example, I have also added one that is "exactly the same" with query expressions:
let quoteQuery' =
query {
for row in table do
groupBy (row.Quote, row.Item) into g
let maxRow =
query {
for row in g do
maxBy row.Sequence
}
select (fst g.Key, maxRow, snd g.Key)
}|>Seq.map createTableEntry
quoteQuery' |> Seq.iter printTR
F# Query expression are called query expression because they follow (losely) LINQ query syntax (vs method syntax) and they do implement Linq.IQueriable. So I also think about them as LINQ. While it maybe not idiomatic the above query can be rewritten in method syntax. If you nuget MoreLinq, it's quite succint. "Stealing" the above table definition from #HelgeReneUrholm:
open MoreLinq
table
.GroupBy(fun x -> (x.Quote,x.Item))
.Select(fun x -> x.MaxBy(fun x -> x.Sequence)) |> Seq.iter printTR
1 | 3.0M | "a"
1 | 3.0M | "b"
2 | 2.0M | "x"
3 | 1.0M | "y"
I have time series data in a table. Basically each row has a timestamp and a value.
The frequency of the data is absolutely random.
I'd like to sample it with a given frequency and for each frequency extract relevant information about it: min, max, last, change (relative previous), return (change / previous) and maybe more (count...)
So here's my input:
08:00:10, 1
08:01:20, 2
08:01:21, 3
08:01:24, 5
08:02:24, 2
And I'd like to get the following result for 1 minute sampling (ts, min, max, last, change, return):
ts m M L Chg Return
08:01:00, 1, 1, 1, NULL, NULL
08:02:00, 2, 5, 5, 4, 4
08:03:00, 2, 2, 2, -3, -0.25
You could do it with something like this (comments inline):
SELECT
min
, mn
, mx
, l
, l - LAG(l, 1) OVER (ORDER BY min) c
-- This might not be the right calculation. Unsure how -0.25 was derived in question.
, (l - LAG(l, 1) OVER (ORDER BY min)) / (LAG(l, 1) OVER (ORDER BY min)) r
FROM
(
SELECT
min
, MIN(val) mn
, MAX(val) mx
-- We can take MAX here because all l's (last values) for the minute are the same.
, MAX(l) l
FROM
(
SELECT
min
, val
-- The last value of the minute, ordered by the timestamp, using all rows.
, LAST_VALUE(val) OVER (PARTITION BY min ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) l
FROM
(
SELECT
ts
-- Drop the seconds and go back one minute by converting to seconds,
-- subtracting 60, and then going back to a shorter string format.
-- 2000-01-01 is a dummy date just to enable the conversion.
, CONCAT(FROM_UNIXTIME(UNIX_TIMESTAMP(CONCAT("2000-01-01 ", ts), "yyyy-MM-dd HH:mm:ss") + 60, "HH:mm"), ":00") min
, val
FROM
-- As from the question.
21908430_input a
) val_by_min
) val_by_min_with_l
GROUP BY min
) min_with_l_m_M
ORDER BY min
;
Result:
+----------+----+----+---+------+------+
| min | mn | mx | l | c | r |
+----------+----+----+---+------+------+
| 08:01:00 | 1 | 1 | 1 | NULL | NULL |
| 08:02:00 | 2 | 5 | 5 | 4 | 4 |
| 08:03:00 | 2 | 2 | 2 | -3 | -0.6 |
+----------+----+----+---+------+------+
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
You are given a set of blocks to build a panel using 3”×1” and 4.5”×1" blocks.
For structural integrity, the spaces between the blocks must not line up in adjacent rows.
There are 2 ways in which to build a 7.5”×1” panel, 2 ways to build a 7.5”×2” panel, 4 ways to build a 12”×3” panel, and 7958 ways to build a 27”×5” panel. How many different ways are there to build a 48”×10” panel?
This is what I understand so far:
with the blocks 3 x 1 and 4.5 x 1
I've used combination formula to find all possible combinations that the 2 blocks can be arranged in a panel of this size
C = choose --> C(n, k) = n!/r!(n-r)! combination of group n at r at a time
Panel: 7.5 x 1 = 2 ways -->
1 (3 x 1 block) and 1 (4.5 x 1 block) --> Only 2 blocks are used--> 2 C 1 = 2 ways
Panel: 7.5 x 2 = 2 ways
I used combination here as well
1(3 x 1 block) and 1 (4.5 x 1 block) --> 2 C 1 = 2 ways
Panel: 12 x 3 panel = 2 ways -->
2(4.5 x 1 block) and 1(3 x 1 block) --> 3 C 1 = 3 ways
0(4.5 x 1 block) and 4(3 x 1 block) --> 4 C 0 = 1 way
3 ways + 1 way = 4 ways
(This is where I get confused)
Panel 27 x 5 panel = 7958 ways
6(4.5 x 1 block) and 0(3 x 1) --> 6 C 0 = 1 way
4(4.5 x 1 block) and 3(3 x 1 block) --> 7 C 3 = 35 ways
2(4.5 x 1 block) and 6(3 x 1 block) --> 8 C 2 = 28 ways
0(4.5 x 1 block) and 9(3 x 1 block) --> 9 C 0 = 1 way
1 way + 35 ways + 28 ways + 1 way = 65 ways
As you can see here the number of ways is nowhere near 7958. What am I doing wrong here?
Also how would I find how many ways there are to construct a 48 x 10 panel?
Because it's a little difficult to do it by hand especially when trying to find 7958 ways.
How would write a program to calculate an answer for the number of ways for a 7958 panel?
Would it be easier to construct a program to calculate the result? Any help would be greatly appreciated.
I don't think the "choose" function is directly applicable, given your "the spaces between the blocks must not line up in adjacent rows" requirement. I also think this is where your analysis starts breaking down:
Panel: 12 x 3 panel = 2 ways -->
2(4.5 x 1 block) and 1(3 x 1 block)
--> 3 C 1 = 3 ways
0(4.5 x 1 block) and 4(3 x 1 block)
--> 4 C 0 = 1 way
3 ways + 1 way = 4 ways
...let's build some panels (1 | = 1 row, 2 -'s = 1 column):
+---------------------------+
| | | | |
| | | | |
| | | | |
+---------------------------+
+---------------------------+
| | | |
| | | |
| | | |
+---------------------------+
+---------------------------+
| | | |
| | | |
| | | |
+---------------------------+
+---------------------------+
| | | |
| | | |
| | | |
+---------------------------+
Here we see that there are 4 different basic row types, but none of these are valid panels (they all violate the "blocks must not line up" rule). But we can use these row types to create several panels:
+---------------------------+
| | | | |
| | | | |
| | | |
+---------------------------+
+---------------------------+
| | | | |
| | | | |
| | | |
+---------------------------+
+---------------------------+
| | | | |
| | | | |
| | | |
+---------------------------+
+---------------------------+
| | | |
| | | |
| | | | |
+---------------------------+
+---------------------------+
| | | |
| | | |
| | | |
+---------------------------+
+---------------------------+
| | | |
| | | |
| | | |
+---------------------------+
...
But again, none of these are valid. The valid 12x3 panels are:
+---------------------------+
| | | | |
| | | |
| | | | |
+---------------------------+
+---------------------------+
| | | |
| | | | |
| | | |
+---------------------------+
+---------------------------+
| | | |
| | | |
| | | |
+---------------------------+
+---------------------------+
| | | |
| | | |
| | | |
+---------------------------+
So there are in fact 4 of them, but in this case it's just a coincidence that it matches up with what you got using the "choose" function. In terms of total panel configurations, there are quite more than 4.
Find all ways to form a single row of the given width. I call this a "row type". Example 12x3: There are 4 row types of width 12: (3 3 3 3), (4.5 4.5 3), (4.5 3 4.5), (3 4.5 4.5). I would represent these as a list of the gaps. Example: (3 6 9), (4.5 9), (4.5 7.5), (3 7.5).
For each of these row types, find which other row types could fit on top of it.
Example:
a. On (3 6 9) fits (4.5 7.5).
b. On (4.5 9) fits (3 7.5).
c: On (4.5 7.5) fits (3 6 9).
d: On (3 7.5) fits (4.5 9).
Enumerate the ways to build stacks of the given height from these rules. Dynamic programming is applicable to this, as at each level, you only need the last row type and the number of ways to get there.
Edit: I just tried this out on my coffee break, and it works. The solution for 48x10 has 15 decimal digits, by the way.
Edit: Here is more detail of the dynamic programming part:
Your rules from step 2 translate to an array of possible neighbours. Each element of the array corresponds to a row type, and holds that row type's possible neighbouring row types' indices.
0: (2)
1: (3)
2: (0)
3: (1)
In the case of 12×3, each row type has only a single possible neighbouring row type, but in general, it can be more.
The dynamic programming starts with a single row, where each row type has exactly one way of appearing:
1 1 1 1
Then, the next row is formed by adding for each row type the number of ways that possible neighbours could have formed on the previous row. In the case of a width of 12, the result is 1 1 1 1 again. At the end, just sum up the last row.
Complexity:
Finding the row types corresponds to enumerating the leaves of a tree; there are about (/ width 3) levels in this tree, so this takes a time of O(2w/3) = O(2w).
Checking whether two row types fit takes time proportional to their length, O(w/3). Building the cross table is proportional to the square of the number of row types. This makes step 2 O(w/3·22w/3) = O(2w).
The dynamic programming takes height times the number of row types times the average number of neighbours (which I estimate to be logarithmic to the number of row types), O(h·2w/3·w/3) = O(2w).
As you see, this is all dominated by the number of row types, which grow exponentially with the width. Fortunately, the constant factors are rather low, so that 48×10 can be solved in a few seconds.
This looks like the type of problem you could solve recursively. Here's a brief outline of an algorithm you could use, with a recursive method that accepts the previous layer and the number of remaining layers as arguments:
Start with the initial number of layers (e.g. 27x5 starts with remainingLayers = 5) and an empty previous layer
Test all possible layouts of the current layer
Try adding a 3x1 in the next available slot in the layer we are building. Check that (a) it doesn't go past the target width (e.g. doesn't go past 27 width in a 27x5) and (b) it doesn't violate the spacing condition given the previous layer
Keep trying to add 3x1s to the current layer until we have built a valid layer that is exactly (e.g.) 27 units wide
If we cannot use a 3x1 in the current slot, remove it and replace with a 4.5x1
Once we have a valid layer, decrement remainingLayers and pass it back into our recursive algorithm along with the layer we have just constructed
Once we reach remainingLayers = 0, we have constructed a valid panel, so increment our counter
The idea is that we build all possible combinations of valid layers. Once we have (in the 27x5 example) 5 valid layers on top of each other, we have constructed a complete valid panel. So the algorithm should find (and thus count) every possible valid panel exactly once.
This is a '2d bin packing' problem. Someone with decent mathematical knowledge will be able to help or you could try a book on computational algorithms. It is known as a "combinatorial NP-hard problem". I don't know what that means but the "hard" part grabs my attention :)
I have had a look at steel cutting prgrams and they mostly use a best guess. In this case though 2 x 4.5" stacked vertically can accommodate 3 x 3" inch stacked horizontally. You could possibly get away with no waste. Gets rather tricky when you have to figure out the best solution --- the one with minimal waste.
Here's a solution in Java, some of the array length checking etc is a little messy but I'm sure you can refine it pretty easily.
In any case, I hope this helps demonstrate how the algorithm works :-)
import java.util.Arrays;
public class Puzzle
{
// Initial solve call
public static int solve(int width, int height)
{
// Double the widths so we can use integers (6x1 and 9x1)
int[] prev = {-1}; // Make sure we don't get any collisions on the first layer
return solve(prev, new int[0], width * 2, height);
}
// Build the current layer recursively given the previous layer and the current layer
private static int solve(int[] prev, int[] current, int width, int remaining)
{
// Check whether we have a valid frame
if(remaining == 0)
return 1;
if(current.length > 0)
{
// Check for overflows
if(current[current.length - 1] > width)
return 0;
// Check for aligned gaps
for(int i = 0; i < prev.length; i++)
if(prev[i] < width)
if(current[current.length - 1] == prev[i])
return 0;
// If we have a complete valid layer
if(current[current.length - 1] == width)
return solve(current, new int[0], width, remaining - 1);
}
// Try adding a 6x1
int total = 0;
int[] newCurrent = Arrays.copyOf(current, current.length + 1);
if(current.length > 0)
newCurrent[newCurrent.length - 1] = current[current.length - 1] + 6;
else
newCurrent[0] = 6;
total += solve(prev, newCurrent, width, remaining);
// Try adding a 9x1
if(current.length > 0)
newCurrent[newCurrent.length - 1] = current[current.length - 1] + 9;
else
newCurrent[0] = 9;
total += solve(prev, newCurrent, width, remaining);
return total;
}
// Main method
public static void main(String[] args)
{
// e.g. 27x5, outputs 7958
System.out.println(Puzzle.solve(27, 5));
}
}