split quantities algorithm (stock exchanges order) - algorithm

I have a problem where I have multiple (few thousand) quantities that I need to split between a set number of recipients such that each quantity must be split into whole numbers and using the same proportion.
I need to find an algorighm that implements this reliably and efficiently (dont we all ?:-) )
This is to solve a problem in financial markets (stock exchange orders) where an order might get thousands of "fills" and at the end of the day must be distributed to a few clients while maintaining the order's average price. Here's an example:
Total Order Quantity 37300
Quantities filled by the Stock Exchange
Execution 1. 16700 shares filled at price 75.84
Execution 2. 5400 shares filled at price 75.85
Execution 3. 4900 shares filled at price 75.86
Execution 4. 10300 shares filled at price 75.87
Total 37300 shares filled at average price = (16700*75.84 + 5400*75.85 + 4900*75.86 + 10300*75.87) / 37300 = 75.85235925
Suppose I need to split these quantities between 3 clients such that :
Client1: 15000 shares
Client2: 10000 shares
Client3: 12300 shares
Each execution must be split individually (I can't just take each clients requested quantity priced at average price)
My first thought was to split each proportionately :
Client 1 gets 15000/37300=0.402144772
Client 2 gets 10000/37300=0.268096515
Client 3 gets 12300/37300=0.329758713
Which would lead to
Client1 - 15000 Client2 - 10000 Client3 - 12300
Ratio : 0.402144772 Ratio : 0.268096515 Ratio : 0.329758713
Splits (Sorry about the formatting - this was the best I could do in the Post editor)
+-------------+-------------+-------------+
| Client 1 | Client 2 | Client 3 |
+-------------+-------------+-------------+
| 6715.817694 | 4477.211796 | 5506.970509 |
| 2171.581769 | 1447.72118 | 1780.697051 |
| 1970.509383 | 1313.672922 | 1615.817694 |
| 4142.091153 | 2761.394102 | 3396.514745 |
+-------------+-------------+-------------+
| Totals: | | |
| 15000 | 10000 | 12300 |
+-------------+-------------+-------------+
The problem with this is that I can't assign fractional quantities to clients so I need a smart algorithm which adjusts the quantities such that the fractional part of these splits is 0. I understand that this may be impossible in many scenarios so this requirement can be relaxed a little bit so that a certain client gets a little more (or less).
Does anybody know of an algorithm that I can use as a starting point for this problem ?

You can round all the numbers (ratio[n] * totalQuantity) except the last one (possibly the smallest) The last one must be the totalQuantity - sum of the others. This will give you whole number quantities while having a correct total as close to the ratios you choose.

Try to look at this from a different angle. You already know how many shares each client is getting. You want to calculate what's the fair total amount each has to pay and do this without rounding errors.
You therefore want this total dollar amounts to have no rounding issues, i.e. be accurate to the 0.01.
Prices can then be computed using the dollar amounts and displayed to the required precision.
The opposite (calculate prices, then derive amounts) will always yield rounding issues with the dollar amounts.
Assuming price is per 100 units, here's one way to accomplish this:
Calculate total $ for the order (16,700*75.84/100 + 5,400*75.85/100 + 4,900*75.86/100 + 10,300*75.87/100) = $28,292.93
Allocate all clients except 1, based on the ratio quantities ordered / quantities filled:
Client 2 = $28,292.93 / 37,300 * 10,000 = $7,585,24
Price = 7,585,24 / 10,000 * 100 = 75.8524.
Client 3 = $28,292.93 / 37,300 * 12,300 = $9,329.84
Price = $9,329.84 / 12,300 * 100 = 75.85235772
Calculate the last client as the remaining $$$:
$28,292.93 - ($7,585,24 + $9,329.84) = $11,377.85.
Price = $11,377.85 / 15,000 * 100 = 75.85233333
Here I arbitrarily picked Client 1, the one with the largest quantity to be the object of the remainder calculation.

Related

Elasticsearch - Sum by quantity and sort by lowest price

I have a requirement in Elasticsearch which I'm not able to implement at the moment. The use case is as follows; we have certain products uploaded in elastic (1 million + items) and each item has a quantity, a price and a lead time (for delivery).
Now I basically want to get the top matches (based on a product description search) where tot sum of all quantities = 1000 (example) sorted by the lowest price.
A similar but other query would be to get the top 1000 items with the lowest lead time.
Any recommendation on how to implement this and what the most performant way of doing this is?
Assume we have the following records:
Product 1 | Quantity 200 | price 4USD | lead time 2 days
Product 2 | Quantity 150 | price 3USD | lead time 5 days
Product 3 | Quantity 275 | price 5 USD | lead time 14
Now I want to get all products for a maximum of quantity of 200 with the cheapest items first. That would give me something like:
Product 2
Product 1
And then it would also give me some aggregates like the average delivery time for these 2 items is 3.5 days and total value is 650USD (150 x 3USD + 50 x 4 USD)
Thanks,
Bram

How to decide the probability percentage in question

I have the below question:
In the first part of the question, is says the probability that the selected person will be a male is 0.44, it means the number of males is 25*0.44 = 11. That's ok
In the second part, the probability of the selected person will be a male who was born before 1960 is 0.28, Does that mean 0.28 out of the total number which is 25 or out of the number of males?
I mean should the number of male who was born before 1960 equals into 250.28 OR 110.28
I find it easiest to think of these sorts of problems as contingency tables.
You use a maxtrix layout to express the distributions in terms of two or more factors or characteristics, each having two or more categories. The table can be constructed either with probabilities (proportions) or with counts, and switching back and forth is easy based on the total count in the table. Entries in the table are the intersections of the categories, corresponding to and in a verbal description. The numbers to the right or at the bottom of the table are called marginals, because they're found in the margins of the tables, and are always the sum of the table row or column entries in which they occur. The total probability (or count) in the table is found by summing across all the rows and columns. The marginal distribution of gender would be found by summing across rows, and the marginal distribution of birthdays would be found by summing across the columns.
Based on this, you can inferentially determine other values as indicated by the entries in parentheses below. With one more entry, either for gender or in the marginal row for birthdays, you'd be able to fill in the whole table inferentially. (This is related to the concept of degrees of freedom - how many pieces of info can you fill in independently before the others are determined by the known constraint that the totals are fixed or that probability adds to 1.)
Probabilities
Birthday
< 1960 | >= 1960
_______________________
G | | |
e F | | | (0.56)
n __|_________|__________|
d | | |
e M | 0.28 | (0.16) | 0.44
r __|_________|__________|______
? ? | 1.00
Counts
Birthday
< 1960 | >= 1960
_______________________
G | | |
e F | | | (14)
n __|_________|__________|
d | | |
e M | 7 | (4) | 11
r __|_________|__________|_____
? ? | 25
Conditional probability corresponds to limiting yourself to the subset of rows or columns specified in the condition. If you had been asked what is the probability of a birthday < 1960 given the gender is male, i.e., P{birthday < 1960 | M} in relatively standard notation, you'd be restricting your focus to just the M row, so the answer would be 7/11 = 0.28/0.44. Computationally, you take the probabilities or counts in the qualifying table entries and express them as a proportion of the probabilities or counts of the specified (given) marginal entries. This is often written in prob & stats texts as P(A|B) = P(AB)/P(B), where AB is a set shorthand for A and B (intersection).
0,44 = 11 / 25 people are male.
0,28 = 7 / 25 people are male & born before 1960.

Determine max slope of slowly descending signal

I have an analog power signal from a motor. The signal ramps up quickly, but powers off slowly over the course of several seconds. The signal looks almost like a series of plateaus on the descent. The problem is that the signal doesn't settle back to zero. It settles back to an intermediate level unknown, and varying from motor to motor. See chart below.
I'm trying to find a way determine when the motor is off and at that intermediate level.
My thought is to find and store the max point, and calculate the slopes thereafter until the max slope is greater than some large negative slope value like -160 (~ -60 degrees), and declare that the motor must be powering off. The sample points below are with all duplicates removed. (there's about 5000 samples typically).
My problem is determining the X values. In the formula (y2-y1) / (x2 - x1), the x values could far enough away in time that the slope never appears greater than -30 degrees. Picking an absolute number like 10 would fix this, but is there a more mathematically correct method?
The data shows me calculating slope with method described above and the max of 921. ie (y2 -y1) / ( (10+1) - 10). In this scheme, at datapoint 9, i would say the motor is "Off". I'm looking for a more precise means to determine an X value rather than randomly picking 10 for instance.
+---+-----+----------+
| X | Y | Slope |
+---+-----+----------+
| 1 | 65 | 856.000 |
| 2 | 58 | 863.000 |
| 3 | 57 | 864.000 |
| 4 | 638 | 283.000 |
| 5 | 921 | 0.000 |
| 6 | 839 | -82.000 |
| 7 | 838 | -83.000 |
| 8 | 811 | -110.000 |
| 9 | 724 | -197.000 |
+---+-----+----------+
EDIT: A much simpler answer:
Since your motor is either ON or OFF, and ON wattages are strictly higher than OFF wattages, you should be able to discriminate between ON and OFF wattages by maintaining an average wattage, reporting ON if the current measurement is higher than the average and OFF if it is lower.
Count = 0
Average = 500
Whenever a measurement comes in,
Count = Count + 1
Average = Average + (Measurement - Average) / Count
Return Measurement > Average ? ON : OFF
This represents an average of all the values the wattage has ever been. If we want to eventually "forget" the earliest values (before the motor was ever turned on), we could either keep a buffer of recent values and use that for a moving average, or approximate a moving average with an IIR like
Average = (1-X) * Average + X * Measurement
for some X between 0 and 1 (closer to 0 to change more slowly).
Original answer:
You could treat this as an online clustering problem, where you expect three clusters (before the motor turns on, when the motor is on, and when the motor is turned off), or perhaps four (before the motor turns on, peak power, when the motor is running normally, and when the motor turns off). In effect, you're trying to learn what it looks like when a motor is on (or off).
If you don't have any other information about whether the motor is on or off (which could be used to train a model), here's a simple approach:
Define an "Estimate" to contain:
float Value
int Count
Define an "Estimator" to contain:
float TotalError = 0.0
Estimate COLD_OFF = {Value = 0, Count = 1}
Estimate ON = {Value = 1000, Count = 1}
Estimate WARM_OFF = {Value = 500, Count = 1}
a function Update_Estimate(float Measurement)
Find the Estimate E such that E.Value is closest to Measurement
Update TotalError = TotalError + (E.Value - Measurement)*(E.Value - Measurement)
Update E.Value = (E.Value * E.Count + P) / (E.Count + 1)
Update E.Count = E.Count + 1
return E
This takes initial guesses for what the wattages of these stages should be and updates them with the measurements. However, this has some problems. What if our initial guesses are off?
You could initialize some number of Estimators with different possible (e.g. random) guesses for COLD_OFF, ON, and WARM_OFF; after receiving a measurement, let each Estimator update itself and aggregate their values somehow. This aggregation should reward the better estimates. Since you're storing TotalError for each estimate, you could just pick the output of the Estimator that has the lowest TotalError so far, or you could let the Estimators vote (giving each Estimator's vote a weight proportional to 1/(TotalError + 1) or something like that).

Find the optimum number of non uniform bins

R - Problem: to find the optimum number of non-uniform bins to show a range of data points.
I have a bunch of data points (let us assume different prices of different mobiles). I need to categorize these mobile phones into some categories (based on the price). The bin size (in this example refers to the price range) need not be uniform (there might be lots of mobiles in the low price category and few in the long tail category).
Is there any efficient algorithm to find the optimum number of bins required and the number of data points (in this case mobile phones) which shall go into each category.
This is not a standard formula, but wanted to post as it seem to work well with data set i tested.
Find the average price of all the mobiles.
Ex: 5 mobiles with prices 10, 20, 40, 80, 200
Avg is 350/5 = 70
Subtract minimum price from average price: 70 - 10 = 60 -> name it N1
Subtract avg price from Max price: 200 - 70 = 130 -> name it N2
Find the ratio N2/N1 : 130/60: Roughly 2
This indicates that it is better to have 2 bins at the lower price range for every 1 bin at higher range.
So, for example take 2 bins below 70. Range 0 - 35(2 mobiles), 36 - 70(1 mobile)
1 bin above 70: Range 71 - 200(2 mobiles)
As you can see, number of bins and bin sizes are reasonably optimal.

Faster computation to get multiples of number at different levels

Here is the scenario:
We have several items that are shipped to many stores. We want to be able to allocate a certain quantity of each item to a store based on need. Each of these stores is also associated to a specific warehouse.
The catch is that at the warehouse level, the total quantity of each item must be a multiple of a number (6 for example).
I have already calculated out the quantity needed by each store at store level, but they do not sum up to a multiple of 6 at the warehouse level.
My solution was this using Excel:
Using a SUMIFS formula to keep track of the sum of each item allocated at the warehouse level. Then another MOD(6) formula that calculates the remaining until a multiple of 6. Then my actually VBA code loops through and subtracts 1 (if MOD <= 3) or adds (if MOD > 3) from the store level units needed until MOD = 0 for all rows.
Now this works for me, but is extremely slow even when I have just ~5000 rows.
I am looking for a faster solution, because everytime I subtract/add to units needed, the SUMIFS and MOD need to be calculated again.
EDIT: (trying to be clearer)
I have a template file that I paste my data into with the following setup:
+------+-------+-----------+----------+--------------+--------+
| Item | Store | Warehouse | StoreQty | WarehouseQty | Mod(6) |
+------+-------+-----------+----------+--------------+--------+
| 1 | 1 | 1 | 2 | 8 | 2 |
| 1 | 2 | 1 | 3 | 8 | 2 |
| 1 | 3 | 1 | 1 | 8 | 2 |
| 1 | 4 | 1 | 2 | 8 | 2 |
| 2 | 1 | 2 | 1 | 4 | 2 |
| 2 | 2 | 2 | 3 | 4 | 2 |
+------+-------+-----------+----------+--------------+--------+
Currently the WarehouseQty column is the SUMIFS formula summing up the StoreQty for each Item-Store combo that is associated to the Warehouse. So I guess the Warehouse/WarehouseQty columns is actually duplicated several times every time an Item-Store combo shows up. The WarehouseQty is the one that needs to be a multiple of 6.
Screen updating can be turned OFF to speed up length computations like this:
Application.ScreenUpdating = FALSE
The opposite assignment turns screen updating back on again.
put the data into an array first, rather than cells, then put the data back after you have manipulated it - this will be much faster.
an example which uses your criteria:
Option Explicit
Sub test()
Dim q() 'this is what will be used for the range
Dim i As Long
q = Range("C2:C41") 'put the data into the array - *ALWAYS* 2 dimensions, even if a single column
For i = LBound(q) To UBound(q) ' use this, in case it's a dynamic array - 1 to 40 would have worked here
Select Case q(i, 1) Mod 6 ' calculate remander
Case 0 To 3
q(i, 1) = q(i, 1) - (q(i, 1) Mod 6) 'make a multiple of 6
Case 4 To 5
q(i, 1) = q(i, 1) - (q(i, 1) Mod 6) + 6 ' and go higher in the later numbers
End Select
Next i
Range("D2:D41") = q ' drop the data back
End Sub
Guessing you may find that stopping the screen refresh may help quite a chunk and therefore not need any more suggestions.
Another option would be to reduce your adjustment to a quantity which is divisible by 6 to a number of if statements, depending on the value of mod(6).
You could also address how you sum up the number of a particular item across all stores, using a pivot table and reading the sum totals from there is a lot quicker than using sumifs in a macro
Based on your modifications to the question:
You're correct that you could have huge amounts of replication doing the calculation row by row, as well as adjusting the quantity by a single unit at a time even though you know exactly how many units you need to add / remove from the mod(6) formula.
Could you not create a new sheet with all your possible combinations of product Id and store. You could then use sumifs() for each of these unique combinations and in a final step round up/down at a warehouse level?

Resources