I am new to pyspark and have a performance issue. Given a dataframe
| Group | Id | Selected |
| ----- | ---- | --------- |
| A | id1 | 0 |
| A | id2 | 0 |
| A | id3 | 0 |
| B | id4 | 0 |
| B | id5 | 0 |
And a sampling dictionary
sample_dict = {'A': 2, 'B': 1}
I want to randomly select (update the Selected column) of the dataframe follows the sampling dictionary. For example
| Group | Id | Selected |
| ----- | ---- | --------- |
| A | id1 | *1* |
| A | id2 | 0 |
| A | id3 | *1* |
| B | id4 | 0 |
| B | id5 | *1* |
Currently, I can only iterate through each group, get the subgroup, randomly select Id and update the original dataframe according to the Ids
for group in sample_dict:
sub_df = df.filter(col('Group') == group)
# id_list = a_udf_randomly_sample_n_Id_from_sub_df
# update df['Selected'] if df['Id'].isin(id_list)
The problem with this approach is sequential processing (do each group one-by-one). If the number of group and total rows scale up, the pyspark code runs slower than the simple pandas version (on databricks).
Would you share better approaches (e.g. processing the incident selection in each group in parallel) for this problem in pyspark?
Thank you very much
Related
First, I've checked other topics on the subject like this one How to transpose/pivot data in hive? but that doesn't match with what I want.
So this is the table I have
| ID | Day | Status |
| 1 | 1 | A |
| 2 | 10 | B |
| 3 | 101 | A |
| 3 | 322 | B |
| 3 | 102 | C |
| 3 | 354 | D |
And i'd like to concat the different Status for each IDs ordering by the Day, in order to have this :
| ID | Status |
| 1 | A |
| 2 | B |
| 3 | A,C,B,D |
The thing is that I don't know how many status I can have, so i can't create as many columns I want for the days since I don't know how many day/status I'll have, so the answers from other topics with group_map or others, I don't know how to adapt it for my problem.
Thank's for helping me ^^
use collect_set (for distinct values) or collect_list to aggregate array and concatenate it using concat_ws:
select ID, concat_ws(',',collect_list(Status)) as Status
from table
group by ID;
Very similar to my last question, now I want only the, "full combination," for a group in order of priority. So, from this source table:
+-------+-------+----------+
| GROUP | State | Priority |
+-------+-------+----------+
| 1 | MI | 1 |
| 1 | IA | 2 |
| 1 | CA | 3 |
| 1 | ND | 4 |
| 1 | AZ | 5 |
| 2 | IA | 2 |
| 2 | NJ | 1 |
| 2 | NH | 3 |
And so on...
I need a query that returns:
+-------+---------------------+
| GROUP | COMBINATION |
+-------+---------------------+
| 1 | MI, IA, CA, ND, AZ |
| 2 | NJ, IA, NH |
+-------+---------------------+
Thanks for the help, again!
Use listagg() ordering by priority within the group.
SELECT "GROUP",
listagg("STATE", ', ') WITHIN GROUP (ORDER BY "PRIORITY")
FROM "ELBAT"
GROUP BY "GROUP";
db<>fiddle
I have 3 tables:
Product:
+----------------------------------------+
| ID_product | name_product | Amount |
+----------------------------------------+
| 0 | Door | 450 |
+----------------------------------------+
| 1 | Fence | 1500 |
+----------------------------------------+
Operation:
+----------------------------------------+
| ID_operation | name_operation | cost |
+----------------------------------------+
| 0 | Repair | 250 |
+----------------------------------------+
| 1 | Build | 320 |
+----------------------------------------+
Process:
+----------------------------------------+
| ID_product | ID_operation |
+----------------------------------------+
| 0 | 0 |
+----------------------------------------+
| 0 | 1 |
+----------------------------------------+
| 1 | 0 |
+----------------------------------------+
| 1 | 1 |
+----------------------------------------+
And need to calculate the sum of costs for each product like this:
Result table:
+-----------------------------------+
| name_product | TOTAL_COSTS |
+-----------------------------------+
| Door | 570 (250+320) |
+-----------------------------------+
| Fence | 570 |
+-----------------------------------+
But i don't have any idea how. I think I need some JOINS like below but I don't know how to handle the sum.
SELECT name_product, operation.cost
FROM product
JOIN process ON product.ID_product = process.ID_product
JOIN operation ON operation.ID_operation = process.ID_operation
ORDER BY product.ID_product;
Try the below Query
SELECT P.NAME_PRODUCT,SUM(O.COST)COST
FROM PROCESS PR,PRODUCT P,OPERATION O
WHERE PR.ID_PRODUCT=P.ID_PRODUCT
AND PR.ID_OPERATION=O.ID_OPERATION
GROUP BY P.NAME_PRODUCT;
You are almost there. Your JOINs are OK, you just need to add a GROUP BY clause with aggregate function SUM.
SELECT product.name_product, SUM(operation.cost) total_costs
FROM product
JOIN process ON product.ID_product = process.ID_product
JOIN operation ON operation.ID_operation = process.ID_operation
GROUP BY product.ID_product, product.name_product
ORDER BY product.ID_product;
when I study the algorithms for string matching, I met the lecture notes says that for example the pattern is abaab, the Morris-Pratt table is:
| a | b | a | a | b |
| 0 | 0 | 1 | 1 | 2 |
I understand how this is generated but the KMP table given is:
| a | b | a | a | b |
| 0 | -1| 1 | 0 | 2 |
Can anyone help me understand why the 2nd table is like that?
Thanks/
Assume a range of values inserted in a schema table and in the end of the month i want to apply for these records (i.e. 2500 rows = numeric values) the algorithm: sort the values descending (from the smallest to highest value) and then find the 80% value of the sorted column.
In my example, if each row increases by one starting from 1, the 80% value will be the 2000 row=value (=2500-2500*20/100). This algorithm needs to be implemented in a procedure where the number of rows is not constant, for example it can varries from 2500 to 1,000,000 per month
Hint: You can achieve this using Oracle's cumulative aggregate functions. For example, suppose your table looks like this:
MY_TABLE
+-----+----------+
| ID | QUANTITY |
+-----+----------+
| A | 1 |
| B | 2 |
| C | 3 |
| D | 4 |
| E | 5 |
| F | 6 |
| G | 7 |
| H | 8 |
| I | 9 |
| J | 10 |
+-----+----------+
At each row, you can sum the quantities so far using this:
SELECT
id,
quantity,
SUM(quantity)
OVER (ORDER BY quantity ROWS UNBOUNDED PRECEDING)
AS cumulative_quantity_so_far
FROM
MY_TABLE
Giving you:
+-----+----------+----------------------------+
| ID | QUANTITY | CUMULATIVE_QUANTITY_SO_FAR |
+-----+----------+----------------------------+
| A | 1 | 1 |
| B | 2 | 3 |
| C | 3 | 6 |
| D | 4 | 10 |
| E | 5 | 15 |
| F | 6 | 21 |
| G | 7 | 28 |
| H | 8 | 36 |
| I | 9 | 45 |
| J | 10 | 55 |
+-----+----------+----------------------------+
Hopefully this will help in your work.
Write a query using the percentile_disc function to solve your problem. Sounds like it does what you want.
An example would be
select percentile_disc(0.8) within group (order by the_value)
from my_table