How avoid a dataframe.collect() in a pyspark request for better perform

How avoid a dataframe.collect() in a pyspark request for better perform - performance

I have to need find evenements in two dataframe (sources and sons). I look for the KeySource == Key in an time interval before son times (the nearest time if son is unique.) I use a dataframe.collect() to compare first dataframe line to the second. Normally, the code (reproduced from my human memory) do the job, but it's very slow even for small tables. I'm new in spark. May I improve performance ? My platform is in pyspark 2.3.
Thanks!
More details :
Ev1
| Time | Key | Index |
| -------- | -------- | -------- |
| t1 | k1 | i1 |
| t2 | k2 | i2 |
| t3 | k3 | i3 |
| t4 | k1 | i4 |
Ev2
| Time | KeySource| Index |
| -------- | -------- | -------- |
| t1 | k1 | i1 |
| t2 | k3 | i2 |
| t3 | k1 | i3 |
| t4 | k5 | i4 |
I look for the KeySource == Key in an time interval before Ev2.Time (the nearest time if son is unique.)
DeltaT_max = 30000
Unique_Son = True or False
Ev2 = Ev2.withColumn("tmin", Ev2.Time -lit(DeltaT_max ) )
Ev1 = Ev2.withColumn("Son_Index", lit(None)) )
Ev1 = Ev1.sort(desc("Time"))
Ev2_without_father = dict()
for ev2 in Ev2.select("Time", "SourceIndex", "Index", "tmin") .collect() :
tmp_Ev1 = Ev1.filter( (Ev1.Key == ev2.KeySource ) & (Ev1.Time >= ev2.tmin) & (Ev1.Time <= ev2.Time) )
if tmp_Ev1.count() > 0 :
# I'm not sure withColumn here are a good idea but my answer is good...
if Unique_Son :
Ev1 = Ev1.withcolumn("Son_Index", when(ev2.Index == tmp_Ev1 .Index, lit(ev2.firs().Index) ).otherwise(Ev2.Son_Index))
else :
Ev1 = Ev1.withcolumn("Son_Index", when(ev2.Index == tmp_Ev1 .Index, lit(ev2.Index) ).otherwise(Ev2.Son_Index))
else :
# rare case but rebuilt a dataframe after for CSV export
Ev2_without_father[ev2.Index] = ev2.KeySource
Ev2 = Ev2.drop("tmin")
Result = Ev1.select("Index", "Son_Index")
Normally, the code (reproduced from my human memory) do the job, but it's very slow even for some hundred lines. I'm new in spark. May I improve performance ? My platform is in pyspark 2.3.
Thanks!

Related

How to get multiple rows data from a subquery into multiple columns - Oracle SQL

I have a sub-select in my code which returns a discount a product has. The product might in some instance have multiple discounts (discount on discount for a special promo). When I run the code, I get single row sub-query returns more than one row error. I want it to return all 3 rows but in different columns as discount 1, discount 2 and discount 3.
My code is as follows:
select prod.prod_id,
prod.prod_name,
st.store,
reg.region,
(select dis.discount from discounts dis
where prod.prod_id = dis.prod_id
and st.store_cd = dis.store_id
and dis.reg_cd = reg.reg_cd
and dis.eff_dt <= :dt
and (dis.xpir_dt is null or dis.xpir_dt > :dt)
and rownum = 1) as discount
from products prod,
stores st,
region reg
where prod.prod_id = st.prod_id
and st.reg_cd = reg.reg_cd
so I want to get rid of the rownum = 1 as it forces only one discount to be returned and return all the 3 discounts in separate columns.
Edit: there are other sub-queries connected to this (it is a longer code and I only put a segment of it). So removing the subquery and then putting it in the main join clause does not work well when joining to the other subqueries.
Edit 2: Sample data:
products table
| prod_id | prod_name|
| ------- | ---------|
| 1 | mangoes |
| 2 | apples |
discounts table
| prod_id | discount |
| ------- | ---------|
| 1 | 10% |
| 1 | 5% |
| 2 | 3% |
| 2 | 8% |
| 2 | 2% |
There is store and regions table which all have single row entries similar to products table.
The ideal output should be
| prod_id | prod_name| store | region | Discount 1| Discount 2| Discount 3 |
| ------- | ---------| ----- | ------ | --------- | ----------| -----------|
| 1 | mangoes | Mega | GP | 10% | 5% | 0% |
| 2 | apples | Mini | GP | 3% | 8% | 2% |

Just join the table:
SELECT prod.prod_id,
prod.prod_name,
st.store,
reg.region,
dis.discount
FROM products prod,
stores st,
region reg,
discounts dis
WHERE prod.prod_id = st.prod_id
AND st.reg_cd = reg.reg_cd
AND dis.prod_id = prod.prod_id
AND dis.store_id = st.store_cd
AND dis.reg_cd = reg.reg_cd
AND eff_dt <= :dt
AND (xpir_dt is null OR xpir_dt > :dt)

Try the below using PIVOT:
SELECT *
FROM (SELECT pr.prod_id AS pr_id, pr.PROD_NAME, d.DISCOUNT AS disc
FROM products pr JOIN discounts d ON pr.prod_id = d.prod_id) PIVOT (count (
pr_id)
AS tw
FOR (
disc)
IN ('2%',
'10%',
'8%',
'3%',
'5%'))

Unexpected behaviour of rand() in MySQL

I encountered a very weird result while trying to filter my data using RAND() function.
Suppose i have a table filled with some data:
CREATE TABLE `status_log` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`rank` int(11) DEFAULT 50,
)
Then i do the following simple select:
select id,rank as rank,(rand()*100) as thres
from status_log
where rank = 50
and have a clear and expected output:
<...skip...>
| 6575476 | 50 | 34.51090244065123 |
| 6575511 | 50 | 67.84258230388404 |
| 6575589 | 50 | 35.68020727083106 |
| 6575644 | 50 | 74.87329251586766 |
| 6575723 | 50 | 67.32584384020961 |
| 6575771 | 50 | 12.009344726809621 |
| 6575863 | 50 | 58.06919518678374 |
+---------+------+-----------------------+
66169 rows in set (2.502 sec).
So, i generate some random data from 0 to 100 and join each result to the table, around 66000 results in total.
Then i want only a (random) part of the data to be shown. It doesn't have any purpose for production, by the way, it's just some artificial test, so let's not discuss it.
select *
from (
select id,rank as rank,(rand()*100) as thres
from status_log
where rank = 50) t
where thres>rank
order by thres;
After that i get the following:
<...skip...>
| 4396732 | 50 | 99.97966075314177 |
| 4001782 | 50 | 99.98002871869134 |
| 1788580 | 50 | 99.98064143581375 |
| 5300286 | 50 | 99.98275954274717 |
| 146401 | 50 | 99.98552389441573 |
| 4744748 | 50 | 99.98644758014609 |
+---------+------+--------------------+
16449 rows in set (2.188 sec)
It's obvious that for the mean of 50 the expected number of results should be around 33000 out of total 66000. So it seems that the distribution of rand() is biased, correct?
Let's then change > to <:
select *
from (
select id,rank as rank,(rand()*100) as thres
from status_log
where rank = 50) t
where thres<rank
order by thres;
<...skip...>
| 4653786 | 50 | 49.98035016467827 |
| 6041489 | 50 | 49.980370281245904 |
| 5064204 | 50 | 49.989308742796354 |
| 1699741 | 50 | 49.991373205549436 |
| 3234039 | 50 | 49.99390454030959 |
| 806791 | 50 | 49.99575274996064 |
| 3713581 | 50 | 49.99814410693771 |
+---------+------+----------------------+
16562 rows in set (2.373 sec)
Again 16000! So not the half but the quarter of all results is shown!
It seems that the output of rand() inside the brackets is somehow influenced with the expression outside them. How is this possible?
I can also union it:
select * from (select id,rank as rank,(rand()*100) as thres from status_log where rank = 50) t where thres<50
UNION ALL
select * from (select id,rank as rank,(rand()*100) as thres from status_log where rank = 50) t where thres>=50;
The expected number of results has to be somewhere around 66000, but it returns only 33000 or so.
I observe this behavior only when rand() is non-deterministic and is generated dynamically each time. If i do ...select id,rank as rank,(rand(id)*100)... (i.e. make the output of rand() dependent of id), i start getting the expected number of results (33000-ish). The same happens if i precalculate and fill a temporary field in the table.
I also tried making the filtering with rank=30, and the results were ~6000 and ~32000 for < and > respectively.
Version 10.5.8-MariaDB-3, InnoDB

Using a single query with HAVING instead of a subquery with WHERE in the main query seems to work around it.
select id,rank as rank,(rand()*100) as thres
from status_log
where rank = 50
having thres > rank
order by thres
This appears to be this bug:
RAND() evaluated and filtered twice with subquery

Concanate two or more rows from result into single result on CI activerecord

I have situation like this, I want to get value from database(this values used comma delimited) from more than one rows based on month and year that I choose, for more detail check this out..
My Schedule.sql :
+---+------------+-------------------------------------+
|ID |Activ_date | Do_skill |
+---+------------+-------------------------------------+
| 1 | 2020-10-01 | Accountant,Medical,Photograph |
| 2 | 2020-11-01 | Medical,Photograph,Doctor,Freelancer|
| 3 | 2020-12-01 | EO,Teach,Scientist |
| 4 | 2021-01-01 | Engineering, Freelancer |
+---+------------+-------------------------------------+
My skillqmount.sql :
+----+------------+------------+-------+
|ID |Date_skill |Skill |Price |
+----+------------+------------+-------+
| 1 | 2020-10-02 | Accountant | $ 5 |
| 2 | 2020-10-03 | Medical | $ 7 |
| 3 | 2020-10-11 | Photograph | $ 5 |
| 4 | 2020-10-12 | Doctor | $ 9 |
| 5 | 2020-10-01 | Freelancer | $ 7 |
| 6 | 2020-10-04 | EO | $ 4 |
| 7 | 2020-10-05 | Teach | $ 4 |
| 8 | 2020-11-02 | Accountant | $ 5 |
| 9 | 2020-11-03 | Medical | $ 7 |
| 10 | 2020-11-11 | Photograph | $ 5 |
| 11 | 2020-11-12 | Doctor | $ 9 |
| 12 | 2020-11-01 | Freelancer | $ 7 |
+----+------------+------------+-------+
In my website I want to make calculation with those two table. So if in my website want to see start from date 2020-10-01 until 2020-11-01 for total amount between those date, I try to show it with this code :
Output example
+----+-----------+-----------+---------+
|No |Date Start |Date End |T.Amount |
+----+-------- --+-----------+---------+
|1 |2020-10-01 |2020-11-01 |$ 45 | <= this amount came from $5+$7+$5+$7+$5+$9+$7
+----+-------- --+-----------+---------+
Note :
Date Start : Input->post("A")
Date End : Input->post("B")
T.Amount : Total Amount based input A and B (on date)
I tried this code to get it :
<?php
$startd = $this->input->post('A');
$endd= $this->input->post('B');
$chck = $this->db->select('Do_skill')
->where('Activ_date >=',$startd)
->where('Activ_date <',$endd)
->get('Schedule')
->row('Do_skill');
$dcek = $this->Check_model->comma_separated_to_array($chck);
$t_amount = $this->db->select_sum('price')
->where('Date_skill >=',$startd)
->where('Date_skill <',$endd)
->where_in('Skill',$dcek)
->get('skillqmount')
->row('price');
echo $t_amount; ?>
Check_model :
public function comma_separated_to_array($chck, $separator = ',')
{
//Explode on comma
$vals = explode($separator, $chck);
$count = count($vals);
$val = array();
//Trim whitespace
for($i=0;$i<=$count-1;$i++) {
$val[] .= $vals[$i];
}
return $val;
}
My problem is the result from $t_amount not $45, I think there's some miss with my code above, please if there any advice, I very appreciate it...Thank you...

Your first query only return 1 row data.
I think you can do something like this for the first query.
$query1 = $this->db->query("SELECT Do_skill FROM schedule WHERE activ_date >= $startd and activ_date < $startd");
$check = $query1->result_array();
$array = [];
foreach($check as $ck){
$dats = explode(',',$ck['Do_skill']);
$counter = count($dats);
for($i=0;$i<$counter;$i++){
array_push($array,$dats[$i]);
}
and you can use the array to do your next query :)

The array $dcek has the values
Accountant,Medical,Photograph
The query from Codeigniter is
SELECT SUM(`price`) AS `price` FROM `skillqmount`
WHERE `Date_skill` >= '2020-10-01' AND
`Date_skill` < '2020-11-01' AND
`Skill` IN('Accountant', 'Medical', 'Photograph')
which returns 17 - this matches the first three entries in your data.
Your first query will only ever give one row, even if the date range would match multiple rows.

Routing table optimization

The problem i am trying to solve is, I have routing table
|src|dest|port|
| a | b | p1 |
| a | b | p2 |
| a | b | p3 |
| a | c | p1 |
| a | d | p2 |
| a | e | p3 |
This can be optimized to
|src|dest|port|
| a | b |p1,p2,p3|
| a | c | p1 |
| a | d | p2 |
| a | e | p3 |
Which can further be optimized to
|src|dest|port|
| a |b,c | p1 |
| a |b,d | p2 |
| a |b,e | p3 |
I thought of using 3 dimensional representation to solve this problem but again the retrieval will be complicated.
I need to use the best data structure to solve this use case.

The data structure is a set of sets of dest values, where the first set is keyed by src values and the second set is keyed by port values. This groups dest values by src and port:
src => port => [dest]
In python, this can be done with dictionaries:
table = (
('a','b','p1'),
('a','b','p2'),
('a','b','p3'),
('a','c','p1'),
('a','d','p2'),
('a','e','p3'),
)
optimized = {}
for route in table:
src, dest, port = route
o = optimized.get(src, {})
p = o.get(port, [])
p.append(dest)
o[port] = p
optimized[src] = o
for src,route in optimized.iteritems():
for port,dest in route.iteritems():
print src, dest, port
The result (in unsorted order) is:
a ['b', 'd'] p2
a ['b', 'e'] p3
a ['b', 'c'] p1
The lookup can be done using:
dest = optimized[src][port]

MapReduce matrix multiplication complexity

Assume, that we have large file, which contains descriptions of the cells of two matrices (A and B):
+---------------------------------+
| i | j | value | matrix |
+---------------------------------+
| 1 | 1 | 10 | A |
| 1 | 2 | 20 | A |
| | | | |
| ... | ... | ... | ... |
| | | | |
| 1 | 1 | 5 | B |
| 1 | 2 | 7 | B |
| | | | |
| ... | ... | ... | ... |
| | | | |
+---------------------------------+
And we want to calculate the product of this matrixes: C = A x B
By definition: C_i_j = sum( A_i_k * B_k_j )
And here is a two-step MapReduce algorithm, for calculation of this product (I will provide a pseudocode):
First step:
function Map (input is a single row of the file from above):
i = row[0]
j = row[1]
value = row[2]
matrix = row[3]
if(matrix == 'A')
emit(i, {j, value, 'A'})
else
emit(j, {i, value, 'B'})
Complexity of this Map function is O(1)
function Reduce(Key, List of tuples from the Map function):
Matrix_A_tuples =
filter( List of tuples from the Map function, where matrix == 'A' )
Matrix_B_tuples =
filter( List of tuples from the Map function, where matrix == 'B' )
for each tuple_A from Matrix_A_tuples
i = tuple_A[0]
value_A = tuple_A[1]
for each tuple_B from Matrix_B_tuples
j = tuple_B[0]
value_B = tuple_B[1]
emit({i, j}, {value_A * value_b, 'C'})
Complexity of this Reduce function is O(N^2)
After the first step we will get something like the following file (which contains O(N^3) lines):
+---------------------------------+
| i | j | value | matrix |
+---------------------------------+
| 1 | 1 | 50 | C |
| 1 | 1 | 45 | C |
| | | | |
| ... | ... | ... | ... |
| | | | |
| 2 | 2 | 70 | C |
| 2 | 2 | 17 | C |
| | | | |
| ... | ... | ... | ... |
| | | | |
+---------------------------------+
So, all we have to do - just sum the values, from lines, which contains the same values i and j.
Second step:
function Map (input is a single row of the file, which produced in first step):
i = row[0]
j = row[1]
value = row[2]
emit({i, j}, value)
function Reduce(Key, List of values from the Map function)
i = Key[0]
j = Key[1]
result = 0;
for each Value from List of values from the Map function
result += Value
emit({i, j}, result)
After the second step we will get the file, which contains cells of the matrix C.
So the question is:
Taking into account, that there are multiple number of instances in MapReduce cluster - which is the most correct way to estimate complexity of the provided algorithm?
The first one, which comes to mind is such:
When we assume that number of instances in the MapReduce cluster is K.
And, because of the number of lines - from file, which produced after the first step is O(N^3) - the overall complexity can be estimated as O((N^3)/K).
But this estimation doesn't take into account many details: such as network bandwidth between instances of MapReduce cluster, ability to distribute data between distances - and perform most of the calculations locally etc.
So, I would like to know which is the best approach for estimation of efficiency of the provided MapReduce algorithm, and does it make sense to use Big-O notation to estimate efficiency of MapReduce algorithms at all?

as you said the Big-O estimates the computation complexity, and does not take into consideration the networking issues such(bandwidth, congestion, delay...)
If you want to calculate how much efficient the communication between instances, in this case you need other networking metrics...
However, I want to tell you something, if your file is not big enough, you will not see an improvement in term of execution speed. This is because the MapReduce works efficiently only with BIG data. Moreover, your code has two steps, that means two jobs. MapReduce, from one job to another, takes time to upload the file and start the job again. This can affect slightly the performance.
I think you can calculate the efficiently in term of speed and time as the MapReduce approach is for sure faster when it comes to big data. This is if we compared it to the sequential algorithms.
Moreover, efficiency can be with regards to the fault-tolerance. This is because MapReduce will manage to handle failures by itself. So, no need for the programmers to handle instance failure or networking failures..

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How avoid a dataframe.collect() in a pyspark request for better perform - performance

Related

How to get multiple rows data from a subquery into multiple columns - Oracle SQL

Unexpected behaviour of rand() in MySQL

Concanate two or more rows from result into single result on CI activerecord

Routing table optimization

MapReduce matrix multiplication complexity

Categories

Resources