Calculating total cost per customer from a large data file

Calculating total cost per customer from a large data file - algorithm

I have task where I have to read a big file and process the data within. Every row in the file it look like this:
CustomerId ItemId Amount Price
I then need to calculate the total cost for a customer, but first I need to work out the most expensive item purchased. I then having to substract the most expensive item from the total cost.
My idea is first I can make this table:
CustomerId ItemId Total_Cost
Then I sort the table and find highest cost and store this in a variable.
Then I can make this table:
CustomerId Total_Cost
Then I'll subtract the highest cost from each row.
I feel that this is a brute force approach, and I was wondering if there is a more clever and efficient way to do this. Also, I need advice on which library to use. I am confused as to which is best for this problem: Spark, Storm, Flume, or Akka-Stream.

You can do this faster by keeping track of the most expensive item purchased by each customer.
Lets asume your data is:
4, 34, 2, 500
4, 21, 1, 700
4, 63, 5, 300
On the first line, customer 4 purchases 2 items of 500. You do not add this yet to total cost because at this point this purchase is most expensive.
When line 2 comes, you compare this purchase against your most expensive, if more than replace most expensive and add previous most expensive to totalcost. If less, add to totalcost.

Related

Fulfilling maximum customer orders

There is an inventory of products like eg. A- 10Units, B- 15units, C- 20Units and so on. We have some customer orders of some products like customer1{A- 10Units, B- 15Units}, customer2{A- 5Units, B- 10Units}, customer3{A- 5Units, B- 5Units}. The task is fulfill maximum customer orders with the limited inventory we have. The result in this case should be filling customer2 and customer3 orders instead of just customer1.[The background for this problem is a realtime online retail scenario, where we have millions of customers and millions of products and we are trying to fulfill the orders as efficiently as possible]
How do I solve this?Is there an algorithm for this kind of problem, something like optimisation?
Edit: The requirement here is fixed. The only aim here is maximizing the number of fulfilled orders regardless of value. But we have millions of users and millions of products.

This problem includes as a special case a knapsack problem. To see why consider only one product A: the storage amount of the product is your bag capacity, the order quantities are the weights and each rock value is 1. Your problem is to maximize the total value you can fit in the bag.
Don't expect an exact solution for your problem in polynomial time...
An approach I'd go for is a random search: make a list of the orders and compute a solution (i.e. complete orders in sequence, skipping the orders you cannot fulfill). Then change the solution by applying a permutation on the orders and see if it's better.
Keep going with search until time runs out or you're happy with the solution.

It can be solved by DP.
Firstly sort all your orders with respect to A in increasing order.
Use this DP :
DP[n][m][o] = DP[n-a][m-b][o-c] + 1 where n-a>=0 and m-b >=0 o-c>=0
DP[0][0][0] = 1;
Do bottom up computation :
Set DP[i][j][k] = 0 , for all i =0 to Amax; j= 0 to Bmax; k = 0 to Cmax
For Each n : 0 to Amax
For Each m : 0 to Bmax
For Each o : 0 to Cmax
if(n>=a && m>=b && o>= c)
DP[n][m][o] = DP[n-a][m-b][o-c] + 1;
You will then have to find the max value of DP[i][j][k] for all values of i,j,k possible. This is your answer. - O(n^3)

Reams have been written about order fulfillment and yet no one has come up with a standard answer. The reason being that companies have different approaches and different requirements.
There are so many variables that a one size solution that fits all is not possible.
You would have to sit down and ask hundreds of questions before you could even start to come up with an approach tailored to your customers needs.
Indeed those needs might also vary, based on the time of year, the day of the week, what promotions are currently being run, whether customers are ranked, numbers of picking and packing staff/machinery currently employed, nature, size, weight of products, where products are in the warehouse, whether certain products are in fast/automated picking lines, standard picking faces or in bulk. The list can appear endless.
Then consider whether all orders are to be filled or are you allowed to partially fill an order and back-order out of stock products.
Does the entire order have to fit in a single box or are multiple box orders permitted.
Are you dealing with multiple warehouses and if so can partial orders be sent from each or do they have to be transferred for consolidation.
Should precedence be given to local or overseas orders.
The amount of information that you need at your finger tips before you can even start to plan a methodology to fit your customers specific requirements can be enormous and sadly, you are not going to get a definitive answer. It does not exist.
Whilst I realise that this is not a) an answer or b) necessarily a welcome post, the hard truth is that you will require your customer to provide you with immense detail as to what it is that they wish to achieve, how and when.
You job, initially, is the play devils advocate, in attempting to nail them down.
P.S. Welcome to S.O.

How to summarize by calculated measure in Power BI?

I have transactional data which contains customer information as well as stores they shopped from. I can count the number of different stores each customer used by a simple DISTINCTCOUNT([Site Name]) measure.
There are millions of customers and I want to make a simple summary table which shows the sum of # customers who visited X number of stores. Like a histogram. Maximum stores they visited is 6, minimum is 1.
I know there are multiple ways to do this but I am new to DAX and can't do what I think yet.

The easiest way:
Assuming your DISTINCTCOUNT([Site Name]) measure is called CustomerStoreCount ...
Add a new dimension table, StoreCount, to your model containing a single column, StoreCount. Populate it with the values 1,2,3,4,5,6 (... up to maximum number of stores.)
Create a measure, ThisStoreCount = MAX(StoreCount[StoreCount]).
Create a base customer count measure, TotalCustomers:=DISTINCTCOUNT(CustomerTable[Customer])
Create a contextual measure, CustomersWhoVisitedXNumberOfStores := CALCULATE ( TotalCustomers, FILTER(VALUES(CustomerTable[Customer]), ThisStoreCount = CustomerStoreCount) )
On your pivot table / reporting tool, etc. use StoreCount[StoreCount] on the axes and CustomersWhOVisitedXNumberOfStores as the measure.
So basically walk through the customer list (since there's no relationship between StoreCount and CustomerTable), compare that customer's CustomerStoreCount with the maximum StoreCount[StoreCount] value, which for each StoreCount[StoreCount] value is ... drum roll itself. If it matches, keep it, otherwise filter it out; you end up with a count of customers whose store visits equals the value of StoreCount[StoreCount].
And of course the more general modeling hint: when you want to display a metric by something (i.e. customer count by number of stores visited), that something is an attribute, not a metric.

Algorithm needed - benelux contest 2007

This question (last one) appeared in Benelux Algorithm Programming Contest-2007
http://www.cs.duke.edu/courses/cps149s/spring08/problems/bapc07/allprobs.pdf
Problem Statement in short:
A Company needs to figure out strategy when to - buy OR sell OR no-op on a given input so as to maximise profit. Input is in the form:
6
4 4 2
2 9 3
....
....
It means input is given for 6 days.
Day 1: You get 4 shares, each with price 4$ and at-max you can sell 2 of them
Day 2: You get 2 shares, each with price 9$ and at-max you can sell 3 of them
.
We need to output the maximum profit which can be achieved.
I m thinking about how to go for this problem. It seems to me that if we apply brute force, it will take too much time. If this can be converted to some DP problem like 0-1 Knapsack? Some help will be highly appreciated.

it can be solved by DP
suppose there are n days, and the total number of stock shares is m
let f[i][j] means, at the ith day, with j shares remaining, the maximum profit is f[i][j]
obviously, f[i][j]=maximum(f[i-1][j+k]+k*price_per_day[i]), 0<=k<=maximum_shares_sell_per_day[i]
it can be further optimized that, since f[i][...] only depends on f[i-1][...], a rolling array can be used here. hence u need only to define f[2][m] to save space.
total time complexity is O(n*m*maximum_shares_sell_per_day).
perhaps it can be further optimized to save time. any feedback is welcome

Your description does not quite match the last problem in the PDF - in the PDF you receive the number of shares specified in the first column (or are forced to buy them - since there is no decision to make it does not matter) and can only decide on how many shares to sell. Since it does not say otherwise I presume that short selling is not allowed (else ignore everything except the price and go make so much money on the derivatives market that you afford to both bribe the SEC or congress and retire :-)).
This looks like a dynamic program, where the state at each point in time is the total number of shares you have in hand. So at time n you have an array with one element for each possible number of shares you might have ended up with at that time, and in that element you have the maximum amount of money you can make up to then while ending up with that number of shares. From this you can work out the same information for time n+1. When you reach the end, then all your shares are worthless so the best answer is the one associated with the maximum amount of money.

We can't do better than selling the maximum amount of shares we can on the day with the highest price, so I was thinking: (this may be somewhat difficult to implement (efficiently))
It may be a good idea to calculate the total number of shares received so far for each day to improve the efficiency of the algorithm.
Process the days in decreasing order of price.
For a day, sell amount = min(daily sell limit, shares available) (for the max price day (the first processed day), shares available = shares received to date).
For all subsequent days, shares available -= sell amount. For preceding days, we binary search for (shares available - shares sold) and all entries between that and the day just processed = 0.
We might not need to physically set the values (at least not at every step), just calculate them on-the-fly from the history thus-far (I'm thinking interval tree or something similar).

What's the best design in tracking the remaining inventory of a product in a store

Sorry if the title is confusing, I'll just try to describe here I want to achieve.
I want to optimize my database design that handles delivery, and ending inventory. Delivery is done anytime of the week and is group by week number, orders can be done anytime of the day; orders quantity are then subtracted to the total no of delivery per week to get the ending inventory. What's the best database design for this, and programming approach?
What I have:
Deliveries table with quantity, weekNo, weekYr
Orders table with quantity, weekNo, weekYr
Everytime I want to get the ending inventory I will get and group the data base on weekYr and weekNo and subtract total Deliveries quantity minus Orders quantity. But my problem is the ending inventory will be carried out to the next week. What's the best and optimized way to do it?
Thanks,
czetsuya

Your current approach seems sound to me, so you might clarify what the actual problem is. Your last sentence is confusing--does the product spoil at the end of the week? It's not clear why you would need to group by week at all. If you get 100 products via delivery, and sell 10 products per week for the next three weeks, you have 70 products left.
My best guess is you have a case where there are other factors to consider besides the simple math of what was received minus what was sold. Perhaps you lose inventory due to spoilage (maybe you sell some sort of food) or shrinkage (maybe you sell retail goods that get stolen). One solution would be to have a separate table called "shrinkage" or "spoilage" that also gets subtracted out of deliveries to arrive at your actual inventory. Of course, this table will need to be updated as product is removed from the shelves due to spoilage, or when the shrinkage is realized.

optimal allocation of products to maximize time before restocking

stock allocation problem.
I have a problem where each of a known set of products with various rates of sale need to be allocated into one of more of a fixed number of buckets.
Each product must be in at least one bucket and buckets cannot share product.
All buckets must be filled, and products will usually be in more than one bucket
My problem is to optimize the allocation of products into all of the buckets such that it maximises the amount of time before any one product sells out.
To complicate matters, each type of bucket may hold differing amounts of each type of product.
This is not necessarily related to the size of the product (which is not known), but may be arbitrary.
Eg,
Bucket A holds 10 Product 1, Bucket B holds 20 product 2, however
Bucket A holds 5 Product 2, Bucket B holds 8 Product 1.
So, as inputs we have a set of products and their sales velocity eg
Product 1 Sells 6 per day
Product 2 Sells 5 per day
Product 3 Sells 4 per day
Product 4 Sells 7 per day
A set of Buckets
Bucket A
Bucket B
Bucket C
Bucket D
Bucket E
Bucket F
Bucket G
And a Product-Bucket lookup table to determine each buckets capacity for each product eg
Prod 1 Bucket A = 40;
Prod 1 Bucket B = 45:
Prod 1 Bucket C = 40;
...
Prod 2 Bucket A = 35;
...
Prod 2 Bucket E = 20;
...
etc
Approaches i have tried so far include
reduce the products per bucket to a common factor - until I realised the product-bucket size relationship was arbitrary.
Place products into buckets at random and the iterate through each product swapping for an existing product in a bucket and test whether it improves the time taken till sold out.
My concerns with this approach are that it may take a path that is optimal at the decision time but obscures a later more optimal choice.
or perhaps the optimal decision requires multiple product changes that will never occur because the individual choices are not optimal.
An exhaustive search - turns out this produces a very large combination of possibilities for a not so large set of products and buckets.
I initially thought the optimum solution would be allocate products in the same ratio as their sale rates, but discovered this not to be true as a configuration holding a very small number of products matching their sales ratios perfectly would be less desirable than a configuration holding much more stock and thus having a longer sale time before first sell out.
Any c# or pseudo code appreciated

I suggest a variant of approach 2 based on simulated annealing -- great approach to optimization where your underlying strategy is based on steepest-descent or the like. Wikipedia does a good job explaining the idea; the crucial conceptual part is:
each step of the SA algorithm replaces
the current solution by a random
"nearby" solution, chosen with a
probability that depends on the
difference between the corresponding
function values and on a global
parameter T (called the temperature),
that is gradually decreased during the
process

I think this problem may be NP complete and that you may have to resort to the usual methods GA/SA/Breadth/Depth searches and/or settle for non-optimal solutions depending on how many buckets/products you have.
Assuming that you have enough product to fit all your buckets (which you don't say), you may be able to brute force a single product with every bucket to determine which product is the best for each bucket. I somehow doubt that this is the case, but in case it is, here is the general algorithm.
(extremely pseudo-code python. This does not run unmodified!!)
index = {} # a hash table containing hash tables of buckets
for bucket in buckets:
for product in products:
capacity = find_capacity(bucket,product)
sell_rate = 1/sales_velocity[product] #assuming sales_velocity are not fractions
longevity = capacity * sell_rate
index[bucket][product] = longevity
for bucket in buckets:
product = find_maximum_longevity(index, bucket)
print bucket, product

Simulated annealing sounds good, although you have to be careful choosing the parameters and the mutation functions to get a good solution.
You could also specify the problem as a series of equations and call an Integer Programming (IP) package such as http://www.coin-or.org/ to find an optimal or near-optimal solution.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio