Is there an efficent method similar to sql window functions in PySpark? - performance

I am dealing with a huge dataframe containing 3 columns and 5 Bil. rows. The size of the data is 360 GB. For the analysis of the Data I am using the following set-up:
-Jupyternotebooks running on a AWS r4.16xlarge
-PySpark Kernel
The table called customer_sales looks similar to the following example:
+--------------------+----------+-------+
| business_unit_id | customer | sales |
+--------------------+----------+-------+
| 1 + a + 5000 +
| 1 + b + 2000 +
| 1 + c + 3000 +
| 1 + d + 5000 +
| 2 + f + 600 +
| 2 + c + 7000 +
| 3 + j + 200 +
| 3 + k + 800 +
| 3 + c + 4500 +
Now I want to get for each business_unit_id the customer with the highest sales. If there is a draw in sales between several customer I want to get them all. The information should be stored in a table called best_customers_for_each_unit. So for the above illustrated example the table best_customers_for_each_unit looks as follows:
+--------------------+----------+-------+
| business_unit_id | customer | sales |
+--------------------+----------+-------+
| 1 + a + 5000 +
| 1 + d + 5000 +
| 2 + c + 7000 +
| 3 + c + 4500 +
In a second step I want to count how often it happens that a customer is the one with the highest sales in a specific business_unit_id. The output of this query will be:
+----------+-------+
| customer | count |
+----------+-------+
| a + 1 +
| b + 1 +
| c + 2 +
For the first query I used spark.sql with window functions. The used query looks as follows:
best_customers_for_each_unit = spark.sql("""
SELECT
business_unit_id,
customer,
sales
FROM (
SELECT
business_unit_id,
customer,
sales,
dense_rank() OVER (PARTITION BY business_unit_id ORDER BY sales DESC)as rank
FROM customer_sales) tmp
WHERE
rank =1
""")
For the second query I used the following PySpark snipped:
best_customers_for_each_unit.groupBy("customer").count()
My queries do actually work, but it takes ages to only process a minor part of the data. So do you know any efficient way to do such queries with PySpark?
Regards

Related

How to simplify material implication?

you know that the truth table for material implication is:
A | C | Y = A --> C
0 | 0 | 1
0 | 1 | 1
1 | 0 | 0
1 | 1 | 1
From this table we can deduce
A --> C = Y = ~A~C + ~AC + AC (where ~X stands for NOT X)
But it is also well known that
A --> C = ~(A~C)
I can't reduce the 1st expression (~A~C + ~AC + AC) to the 2nd ( ~(A~C) ), can you show me through which steps can you obtain the 2nd from the 1st?
Thank you.
(~A~C + ~AC + AC)
(~A~C + ~AC) + AC
~A(~C + C) + AC
~A(T) + AC
~A + AC
~~(~A + AC)
~((~~A)~(AC))
~(A~(AC))
~(A(~A + ~C))
~(A~A + A~C)
~(F + A~C)
~(A~C)

How can I group results in multpiple conditions subgroups with eloquent laravel?

I have a product table and I am trying to make a query with eloquent in which I group the results into five subgroups depending on conditions.
Each subgroup will show the total number of rows that meet this condition.
DB::table(Product::TABLE_NAME)->select(
DB::raw("prize"),
DB::raw("count(*) as total")
)
->groupBy('prize')->get();
I don't know where put the conditions.
I expect this output:
id_group | total |
------------------
1 | 34 |
2 | 46 |
3 | 126 |
id_group represents each condition:
1 - prize = null
2 - prize = 0
3 - prize > 0
Solution:
I've got this, but I don't know if is the more efficient way or if exists other way to do it:
DB::table(Product::TABLE_NAME)->select(
DB::raw("SUM(CASE
WHEN prize IS NULL THEN 1 ELSE 0 END) AS 1"),
DB::raw("SUM(CASE
WHEN prize = 0 THEN 1 ELSE 0 END) AS 2"),
DB::raw("SUM(CASE
WHEN prize > 0 THEN 1 ELSE 0 END) AS 3")
)->get();

Oracle apex cascading numeric values

I have the following coefficient table for exams.
val_disp | val_ret
%0 | 0
%10 | 0.1
%20 | 0.2
%30 | 0.3
%40 | 0.4
%50 | 0.5
%60 | 0.6
%70 | 0.7
%80 | 0.8
%90 | 0.9
%100 | 1
And I listing these values as lov in select lists like this
https://imgur.com/qslkXzt
Is it possible to cascade these values to sum up %100? For ex... if I choose %30 for midterm and %40 for final, max available value for assignment can be cascaded to %30 by summing up 2 item value?
And how can I use val_ret of selected list item in sql query?
Or am I in a realy realy wrong way?
Set "Cascading LOV parent item(s)" property to previous LOVs.
In a LOV query, reference previously set values.
For example:
LOV_1:
select '%' || 10 * (level - 1) val_disp,
(level - 1) / 10 val_ret
from dual
connect by level <= 11;
LOV_2: cascading LOV is LOV_1
select '%' || 10 * (level - 1) val_disp,
(level - 1) / 10 val_ret
from dual
connect by level <= (1 - :LOV_1) * 10 + 1;
LOV_3: cascading LOVs are LOV_1 and LOV_2
select '%' || 10 * (level - 1) val_disp,
(level - 1) / 10 val_ret
from dual
connect by level <= (1 - (:LOV_1 + :LOV_2)) * 10 + 1;

Determine number of row spans given a start index and a length

I have a matrix start index of my data and the number of elements of my data
and I need to find the number of rows that the data span.
e.g. the matrix
0 5
-------------------
| | | | |x |x |
-------------------
|x |x |x |x |x |x |
-------------------
| | | | | | |
-------------------
| | | | | | |
-------------------
My data is marked with x. I know the start index 4, the length of the data 8.
And I need to determine the number of rows this data spans, 2 in this case. (just doing length/6 is off by one in many cases, surely there have to be a simple formula for this..)
If you only know offset (i.e. index of the starting column), size (i.e. how many data), cols (i.e. maximum number of colums), and you want to calculate how many rows your data will span, you can do
int get_spanned_rows(int offset, int size, int cols) {
int spanned_rows = (offset + size) / cols
if ( ( (offset + size ) % cols) != 0 )
spanned_rows++
return spanned_rows
}
where % is the modulus (or reminder) operator
int calc_rows(int start,int length){
int rows = 0;
int x= 0;
if (start != 0){
x = 6%start;
rows+=1;
}
if ((start+length) % 6 != 0){
x +=(start+length) % 6;
rows+=1;
}
rows+= (length - x)/6;
return rows;
}
calculates number of rows by dividing by 6 but after subtracting partially filled rows count
In case you only want the number of rows you can calculate
num_rows = (offset + size + cols - 1) / cols
which in this case is
num_rows = (4 + 8 + 6 - 1) / 6 = 17 / 6 = 2

Quickest way to determine range overlap in Perl

I have two sets of ranges. Each range is a pair of integers (start and end) representing some sub-range of a single larger range. The two sets of ranges are in a structure similar to this (of course the ...s would be replaced with actual numbers).
$a_ranges =
{
a_1 =>
{
start => ...,
end => ...,
},
a_2 =>
{
start => ...,
end => ...,
},
a_3 =>
{
start => ...,
end => ...,
},
# and so on
};
$b_ranges =
{
b_1 =>
{
start => ...,
end => ...,
},
b_2 =>
{
start => ...,
end => ...,
},
b_3 =>
{
start => ...,
end => ...,
},
# and so on
};
I need to determine which ranges from set A overlap with which ranges from set B. Given two ranges, it's easy to determine whether they overlap. I've simply been using a double loop to do this--loop through all elements in set A in the outer loop, loop through all elements of set B in the inner loop, and keep track of which ones overlap.
I'm having two problems with this approach. First is that the overlap space is extremely sparse--even if there are thousands of ranges in each set, I expect each range from set A to overlap with maybe 1 or 2 ranges from set B. My approach enumerates every single possibility, which is overkill. This leads to my second problem--the fact that it scales very poorly. The code finishes very quickly (sub-minute) when there are hundreds of ranges in each set, but takes a very long time (+/- 30 minutes) when there are thousands of ranges in each set.
Is there a better way I can index these ranges so that I'm not doing so many unnecessary checks for overlap?
Update: The output I'm looking for is two hashes (one for each set of ranges) where the keys are range IDs and the values are the IDs of the ranges from the other set that overlap with the given range in this set.
This sounds like the perfect use case for an interval tree, which is a data structure specifically designed to support this operation. If you have two sets of intervals of sizes m and n, then you can build one of them into an interval tree in time O(m lg m) and then do n intersection queries in time O(n lg m + k), where k is the total number of intersections you find. This gives a net runtime of O((m + n) lg m + k). Remember that in the worst case k = O(nm), so this isn't any better than what you have, but for cases where the number of intersections is sparse this can be substantially better than the O(mn) runtime you have right now.
I don't have much experience working with interval trees (and zero experience in Perl, sorry!), but from the description it seems like they shouldn't be that hard to build. I'd be pretty surprised if one didn't exist already.
Hope this helps!
In case you are not exclusively tied to perl; The IRanges package in R deals with interval arithmetic. It has very powerful primitives, it would probably be easy to code a solution with them.
A second remark is that the problem could possibly become very easy if the intervals have additional structure; for example, if within each set of ranges there is no overlap (in that case a linear approach sifting through the two ordered sets simultaneously is possible). Even in the absence of such structure, the least you can do is to sort one set of ranges by start point, and the other set by end point, then break out of the inner loop once a match is no longer possible. Of course, existing and general algorithms and data structures such as the interval tree mentioned earlier are the most powerful.
There are Several existing CPAN modules that solve this issue, I have developed 2 of them: Data::Range::Compare and Data::Range::Compare::Stream
Data::Range::Compare only works with arrays in memory, but supports generic range types.
Data::Range::Compare::Stream Works with streams of data via iterators, but it requires understanding OO Perl to extend to generic data types. Data::Range::Compare::Stream is recommended if you are processing very very large sets of data.
Here is an Excerpt form the Examples folder of Data::Range::Compare::Stream.
Given these 3 sets of data:
Numeric Range set: A contained in file: source_a.src
+----------+
| 1 - 11 |
| 13 - 44 |
| 17 - 23 |
| 55 - 66 |
+----------+
Numeric Range set: B contained in file: source_b.src
+----------+
| 0 - 1 |
| 2 - 29 |
| 88 - 133 |
+----------+
Numeric Range set: C contained in file: source_c.src
+-----------+
| 17 - 29 |
| 220 - 240 |
| 241 - 250 |
+-----------+
The expected output would be:
+--------------------------------------------------------------------+
| Common Range | Numeric Range A | Numeric Range B | Numeric Range C |
+--------------------------------------------------------------------+
| 0 - 0 | No Data | 0 - 1 | No Data |
| 1 - 1 | 1 - 11 | 0 - 1 | No Data |
| 2 - 11 | 1 - 11 | 2 - 29 | No Data |
| 12 - 12 | No Data | 2 - 29 | No Data |
| 13 - 16 | 13 - 44 | 2 - 29 | No Data |
| 17 - 29 | 13 - 44 | 2 - 29 | 17 - 29 |
| 30 - 44 | 13 - 44 | No Data | No Data |
| 55 - 66 | 55 - 66 | No Data | No Data |
| 88 - 133 | No Data | 88 - 133 | No Data |
| 220 - 240 | No Data | No Data | 220 - 240 |
| 241 - 250 | No Data | No Data | 241 - 250 |
+--------------------------------------------------------------------+
The Source code can be found here.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use lib qw(./ ../lib);
# custom package from FILE_EXAMPLE.pl
use Data::Range::Compare::Stream::Iterator::File;
use Data::Range::Compare::Stream;
use Data::Range::Compare::Stream::Iterator::Consolidate;
use Data::Range::Compare::Stream::Iterator::Compare::Asc;
my $source_a=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_a.src');
my $source_b=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_b.src');
my $source_c=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_c.src');
my $consolidator_a=new Data::Range::Compare::Stream::Iterator::Consolidate($source_a);
my $consolidator_b=new Data::Range::Compare::Stream::Iterator::Consolidate($source_b);
my $consolidator_c=new Data::Range::Compare::Stream::Iterator::Consolidate($source_c);
my $compare=new Data::Range::Compare::Stream::Iterator::Compare::Asc();
my $src_id_a=$compare->add_consolidator($consolidator_a);
my $src_id_b=$compare->add_consolidator($consolidator_b);
my $src_id_c=$compare->add_consolidator($consolidator_c);
print " +--------------------------------------------------------------------+
| Common Range | Numeric Range A | Numeric Range B | Numeric Range C |
+--------------------------------------------------------------------+\n";
my $format=' | %-12s | %-13s | %-13s | %-13s |'."\n";
while($compare->has_next) {
my $result=$compare->get_next;
my $string=$result->to_string;
my #data=($result->get_common);
next if $result->is_empty;
for(0 .. 2) {
my $column=$result->get_column_by_id($_);
unless(defined($column)) {
$column="No Data";
} else {
$column=$column->get_common->to_string;
}
push #data,$column;
}
printf $format,#data;
}
print " +--------------------------------------------------------------------+\n";
Try Tree::RB but to find mutually exclusive ranges, no overlaps
The performance is not bad, if I had about 10000 segments and had to find the segment for each discrete number. My input had 300 million records. I neaded to put them into separate buckets. Like partitioning the data. Tree::RB worked out great.
$var = [
[0,90],
[91,2930],
[2950,8293]
.
.
.
]
my lookup value were 10, 99, 991 ...
and basically I needed the position of the range for the given number
With this below comparison function, mine uses something like this:
my $cmp = sub
{
my ($a1, $b1) = #_;
if(ref($b1) && ref($a1))
{
return ($$a1[1]) <=> ($$b1[0]);
}
my $ret = 0;
if(ref($a1) eq 'ARRAY')
{
#
if($$a1[0] <= $b1 && $b1 >= $$a1[1])
{
$ret = 0;
}
if($$a1[0] < $b1)
{
$ret = -1;
}
if($$a1[1] > $b1)
{
$ret = 1;
}
}
else
{
if($$b1[0] <= $a1 && $a1 >= $$b1[1])
{
$ret = 0;
}
if($$b1[0] > $a1)
{
$ret = -1;
}
if($$b1[1] < $a1)
{
$ret = 1;
}
}
return $ret;
}
I should check time in order to know if its the fastest way, but according to the structure of your data you should try this:
use strict;
my $fromA = 12;
my $toA = 15;
my $fromB = 7;
my $toB = 35;
my #common_range = get_common_range($fromA, $toA, $fromB, $toB);
my $common_range = $common_range[0]."-".$common_range[-1];
sub get_common_range {
my #A = $_[0]..$_[1];
my %B = map {$_ => 1} $_[2]..$_[3];
my #common = ();
foreach my $i (#A) {
if (defined $B{$i}) {
push (#common, $i);
}
}
return sort {$a <=> $b} #common;
}

Resources