Approximating Price - algorithm

I have a set of products. Each product is a variation of a non existent “parent”. Also, each product (let’s call them child products) has its own individually assigned price in our database. Here is a small example set.
Parent SKU is 1000.
Product Children are:
1000-TankTop-SM - 14.95
1000-TankTop-2X - 17.95
1000-Hoodie-SM - 34.95
1000-Hooodie-2X - 39.95
Here is the problem. Our database lists each real child product price (as directly above) in a one-to-one relationship. Each product has a SKU and I can look up the price of each product by SKU. I have a website that that cannot support this method of pricing. The way pricing works is this. I create a “parent” product. Each parent product must have a base price. The prices of variations are created from adding or subtracting a dollar amount. So a “parent” has two attribute sets, product type and size. A plus or minus amount must be associated with each attribute. So from my example above we have.
Sizes:
SM +- ?
2X +- ?
Product Types:
TankTop +- ?
Hoodie += ?
How can I decide what the variables above should equal to at least approximate the actual child product prices? Is this possible without any extreme outliers?

This sounds like a frustrating (ie: crummy) database system, since it's effectively impossible to create certain arbitrary prices. ie:
TankTop = + $2.00
Shirt = + $1.00
Sweat = + $5.00
Small = - $1.00
Medium = + $0.00
Large = + $3.00
X-Large = + $5.00
With the above example, it would be impossible to have a Small Shirt cost $10.00 while simultaneously having a Medium Shirt cost $10.50.
So, each product has a price defined as a sum of: BASE_SKU_PRICE + SIZE_MODIFIER + STYLE_MODIFIER. This means that you cannot assign an arbitrary price value to each unique item, so you'll need to use a regression model.
If you want to re-adjust the price for a massive table of items, the easiest approach to minimize outliers would be a multivariate variation of linear least mean square errors approximation (LMS), which is just another type of multivariate linear regression approach.
This will allow you to model each unique item (ie: SKU) as a function of:
y = a + bX_1 + cX_2
If you want a very tidy approach to handling this for a production database system, you would be best off just using MATLAB or SPSS to create your database table, as you can specify confidence intervals, and other parameters to help optimize your approximation.
Finally, I found an example online which you could try out in OpenOffice Calc or Microsoft Excel. This will give you a working algorithmic approach rather than you having to derive the analytical form equations and generate code from them. It might even be enough to solve your problem without having to break out MATLAB or SPSS.

Related

Fulfilling maximum customer orders

There is an inventory of products like eg. A- 10Units, B- 15units, C- 20Units and so on. We have some customer orders of some products like customer1{A- 10Units, B- 15Units}, customer2{A- 5Units, B- 10Units}, customer3{A- 5Units, B- 5Units}. The task is fulfill maximum customer orders with the limited inventory we have. The result in this case should be filling customer2 and customer3 orders instead of just customer1.[The background for this problem is a realtime online retail scenario, where we have millions of customers and millions of products and we are trying to fulfill the orders as efficiently as possible]
How do I solve this?Is there an algorithm for this kind of problem, something like optimisation?
Edit: The requirement here is fixed. The only aim here is maximizing the number of fulfilled orders regardless of value. But we have millions of users and millions of products.
This problem includes as a special case a knapsack problem. To see why consider only one product A: the storage amount of the product is your bag capacity, the order quantities are the weights and each rock value is 1. Your problem is to maximize the total value you can fit in the bag.
Don't expect an exact solution for your problem in polynomial time...
An approach I'd go for is a random search: make a list of the orders and compute a solution (i.e. complete orders in sequence, skipping the orders you cannot fulfill). Then change the solution by applying a permutation on the orders and see if it's better.
Keep going with search until time runs out or you're happy with the solution.
It can be solved by DP.
Firstly sort all your orders with respect to A in increasing order.
Use this DP :
DP[n][m][o] = DP[n-a][m-b][o-c] + 1 where n-a>=0 and m-b >=0 o-c>=0
DP[0][0][0] = 1;
Do bottom up computation :
Set DP[i][j][k] = 0 , for all i =0 to Amax; j= 0 to Bmax; k = 0 to Cmax
For Each n : 0 to Amax
For Each m : 0 to Bmax
For Each o : 0 to Cmax
if(n>=a && m>=b && o>= c)
DP[n][m][o] = DP[n-a][m-b][o-c] + 1;
You will then have to find the max value of DP[i][j][k] for all values of i,j,k possible. This is your answer. - O(n^3)
Reams have been written about order fulfillment and yet no one has come up with a standard answer. The reason being that companies have different approaches and different requirements.
There are so many variables that a one size solution that fits all is not possible.
You would have to sit down and ask hundreds of questions before you could even start to come up with an approach tailored to your customers needs.
Indeed those needs might also vary, based on the time of year, the day of the week, what promotions are currently being run, whether customers are ranked, numbers of picking and packing staff/machinery currently employed, nature, size, weight of products, where products are in the warehouse, whether certain products are in fast/automated picking lines, standard picking faces or in bulk. The list can appear endless.
Then consider whether all orders are to be filled or are you allowed to partially fill an order and back-order out of stock products.
Does the entire order have to fit in a single box or are multiple box orders permitted.
Are you dealing with multiple warehouses and if so can partial orders be sent from each or do they have to be transferred for consolidation.
Should precedence be given to local or overseas orders.
The amount of information that you need at your finger tips before you can even start to plan a methodology to fit your customers specific requirements can be enormous and sadly, you are not going to get a definitive answer. It does not exist.
Whilst I realise that this is not a) an answer or b) necessarily a welcome post, the hard truth is that you will require your customer to provide you with immense detail as to what it is that they wish to achieve, how and when.
You job, initially, is the play devils advocate, in attempting to nail them down.
P.S. Welcome to S.O.

Pure logic: how to get number of shares, knowing part and price of the share

My question is not about programming languages but definetly about programming.
I have a model portfolio with shares:
Part Code Price, $ Number of shares in portfolio
23,80% CSIQ 24,91 ?
18,90% TSL 10,52 ?
11,20% JKS 24,40 ?
10,70% YGE 2,90 ?
35,40% DQ 26,05 ?
I need to calculate minimum number of shares that should be in portfolio so that part of that share in portfolio would equal to part in model portfolio.
Just imagine that you want to purchase such portfolio in real world. How many of each stocks should you buy, to get desired part (which is shown in model portfolio). I can't buy non-integer number of shares and part in recalculated (after purchase) portfolio should equal part in model portfolio.
Example: I need to get portfolio with 50.0% in Google ($500 per share) and 50.0% in Apple ($700 per share). Solution is 5 shares of Apple (total value $3500) and 7 shares of Google (total value $3500).
Let us expand on the approach devised in the comments.
The first step is to choose a share to be a reference point; this can be any, so we'll go with the first one, CSIQ. Let us say then that we will purchase one share of this, so we now know that 23.8% of the portfolio is worth $24.91.
For the second share, this is now the problem we have:
Part Code Price, $ Number of shares in portfolio
23,80% CSIQ 24,91 1
18,90% TSL 10,52 ?
Since we know the value of a fraction of the portfolio, let us work out what the whole portfolio would be:
total_value = (100 / 23.8) * 24.91
= $104.663865546
That means the amount we can spend on TSL is:
tsl_value = 104.663865546 * (18.8/100)
= $19.676806723
We know how much a TSL share costs, so we must buy a non-integer amount of this share:
share_amount = 19.676806723/10.52
= 1.87041908
You can then go through each share in the same way, and end up with a portfolio in the desired ratios.
If you already own a number of shares in one stock, you can modify the algorithm but instead of starting with 1 share, you start with X shares - multiply everything by X and it will still work.
After you added the constraint that shares can only be purchased in integer amounts, I would suggest that you use the X multiplier approach above, coupled with rounding share amounts to the closest integer. As you increase X exponentially (10, 100, etc) your level of inaccuracy due to rounding will get progressively smaller.
As I suggested in the comments, you could build this in a spreadsheet first and determine the level of inaccuracy for inputs of X. Of course, if you plan to actually buy these shares, X is constrained by the amount of money you have; conversely if it is theoretical you can make it 6 or 7 figures and achieve good levels of accuracy.

Which algorithm/implementation for weighted similarity between users by their selected, distanced attributes?

Data Structure:
User has many Profiles
(Limit - no more than one of each profile type per user, no duplicates)
Profiles has many Attribute Values
(A user can have as many or few attribute values as they like)
Attributes belong to a category
(No overlap. This controls which attribute values a profile can have)
Example/Context:
I believe with stack exchange you can have many profiles for one user, as they differ per exchange site? In this problem:
Profile: Video, so Video profile only contains Attributes of Video category
Attributes, so an Attribute in the Video category may be Genre
Attribute Values, e.g. Comedy, Action, Thriller are all Attribute Values
Profiles and Attributes are just ways of grouping Attribute Values on two levels.
Without grouping (which is needed for weighting in 2. onwards), the relationship is just User hasMany Attribute Values.
Problem:
Give each user a similarity rating against each other user.
Similarity based on All Attribute Values associated with the user.
Flat/one level
Unequal number of attribute values between two users
Attribute value can only be selected once per user, so no duplicates
Therefore, binary string/boolean array with Cosine Similarity?
1 + Weight Profiles
Give each profile a weight (totaling 1?)
Work out profile similarity, then multiply by weight, and sum?
1 + Weight Attribute Categories and Profiles
As an attribute belongs to a category, categories can be weighted
Similarity per category, weighted sum, then same by profile?
Or merge profile and category weights
3 + Distance between every attribute value
Table of similarity distance for every possible value vs value
Rather than similarity by value === value
'Close' attributes contribute to overall similarity.
No idea how to do this one
Fancy code and useful functions are great, but I'm really looking to fully understand how to achieve these tasks, so I think generic pseudocode is best.
Thanks!
First of all, you should remember that everything should be made as simple as possible, but not simpler. This rule applies to many areas, but in things like semantics, similarity and machine learning it is essential. Using several layers of abstraction (attributes -> categories -> profiles -> users) makes your model harder to understand and to reason about, so I would try to omit it as much as possible. This means that it's highly preferable to keep direct relation between users and attributes. So, basically your users should be represented as vectors, where each variable (vector element) represents single attribute.
If you choose such representation, make sure all attributes make sense and have appropriate type in this context. For example, you can represent 5 video genres as 5 distinct variables, but not as numbers from 1 to 5, since cosine similarity (and most other algos) will treat them incorrectly (e.g. multiply thriller, represented as 2, with comedy, represented as 5, which makes no sense actually).
It's ok to use distance between attributes when applicable. Though I can hardly come up with example in your settings.
At this point you should stop reading and try it out: simple representation of users as vector of attributes and cosine similarity. If it works well, leave it as is - overcomplicating a model is never good.
And if the model performs bad, try to understand why. Do you have enough relevant attributes? Or are there too many noisy variables that only make it worse? Or do some attributes should really have larger importance than others? Depending on these questions, you may want to:
Run feature selection to avoid noisy variables.
Transform your variables, representing them in some other "coordinate system". For example, instead of using N variables for N video genres, you may use M other variables to represent closeness to specific social group. Say, 1 for "comedy" variable becomes 0.8 for "children" variable, 0.6 for "housewife" and 0.9 for "old_people". Or anything else. Any kind of translation that seems more "correct" is ok.
Use weights. Not weights for categories or profiles, but weights for distinct attributes. But don't set these weights yourself, instead run linear regression to find them out.
Let me describe last point in a bit more detail. Instead of simple cosine similarity, which looks like this:
cos(x, y) = x[0]*y[0] + x[1]*y[1] + ... + x[n]*y[n]
you may use weighted version:
cos(x, y) = w[0]*x[0]*y[0] + w[1]*x[1]*y[1] + ... + w[2]*x[2]*y[2]
Standard way to find such weights is to use some kind of regression (linear one is the most popular). Normally, you collect dataset (X, y) where X is a matrix with your data vectors on rows (e.g. details of house being sold) and y is some kind of "correct answer" (e.g. actual price that the house was sold for). However, in you case there's no correct answer to user vectors. In fact, you can define correct answer to their similarity only. So why not? Just make each row of X be a combination of 2 user vectors, and corresponding element of y - similarity between them (you should assign it yourself for a training dataset). E.g.:
X[k] = [ user_i[0]*user_j[0], user_i[1]*user_j[1], ..., user_i[n]*user_j[n] ]
y[k] = .75 // or whatever you assign to it
HTH

Methods of comparing prices

I will create a list of products that I wish to buy. Let's say they are all given a unique reference code. I have a list of suppliers I can buy from and for convenience each supplier uses the same reference code for each product.
Some suppliers charge shipping. Others only charge shipping if you spend less than a certain amount. Some suppliers discount certain products if you buy them more than once but there may be restrictions (such as by 1 get 1 free).
It is extremely easy to take the list of products I want to buy and tally up the total it would cost to buy all of them from each supplier. What I want to do though is create a script to work out whether it would be better to split the order.
For example:
Retailer A charges:
Product A - £5
Product B - £10
Product C - £10
Product D - £10
Shipping - £5
Retailer B charges:
Product A - £5
Product B - £12
Product C - £12
Product D - £30
Shipping - £5 - free if spending £20 or more
In this case, if I wanted to buy Product C only, the cheapest would be from retailer A.
If I wanted to buy:
1x Product A
2x Product B
1x Product D
The cheapest would be retailer B (because of the free delivery) for products A and B and to then split the order and purchase product D from retailer A (as the price even with delivery is significantly lower even with delivery included).
So in my head it's not a complex task and I can work it out very easily on paper. The question is, how I would translate this into code. I'm not looking for the code to do it - just some guidance on the theory of how to implement it.
If we restrict the problem to simply choosing which vendor to buy each product from, and you get a vendor-dependent reduction in shipping cost if you spend a vendor-dependent amount, then you can formulate your problem as an integer linear program (IP or ILP), which is a good strategy for problems suspected to be NP-hard because there has been a lot of research and software packages developed that try to solve ILP fast in practice. You can read about linear programming and ILP online. An ILP problem instance has variables, linear constraints on the variables, and a linear objective you want to minimize or maximize. Here's the ILP set up for your problem:
For each product that a vendor sells, you have one vendor-product variable that tells how many of the product you will purchase from the vendor. For each of these variables you have a constraint that the variable must be >= 0. For each product you wish to buy, you have a constraint that the sum of all the vendor-product variables for that product must equal the total number of the product that you wish to buy.
Then for each vendor that offers a shipping discount, you have a shipping discount variable which will be either 0 if you don't get the discount, or 1 if you do. For each one of these shipping discount variables, you have constraints that the variable must be >=0 and <= 1; you also have a constraint that says when you multiply each vendor-product variable for the vendor by the vendor's price for that product, and add it all up for the vendor (so you get the total amount you are spending at the vendor), this amount is >= the vendor's shipping discount variable multiplied by the vendor's minimum amount you need to spend to get the discount.
You also have for each vendor a vendor variable which is 1 if you use the vendor, and 0 if you don't. For each of these vendor variables A, you have constraints 1 >= A > =0 and also for each vendor-product variable B for the vendor, you have a constraint A >= B/N, where N is the total number of items you want to buy.
Finally the objective you want to maximize is made by multiplying each vendor-product variable by the vendor's price for that product, adding it all up (call this part of the objective X), and then multiplying each vendor's shipping discount variable by the shipping cost reduction you get if you get the discount, adding it all up (call this part of the objective Y), and multiplying each vendor variable by the vendor's undiscounted shipping cost, adding it all up (call this part of the objective Z) then your objective is simply to minimize X - Y + Z. This is all you need to define the ILP, then you can feed it into an ILP solver and hopefully get a solution quickly.
Mixed Integer Linear Programming is ok for your problem.
You can use a free solver such as Coin Clp. If you want to know about commercial MILP solver performances, you can find some benchmarks there : http://plato.asu.edu/bench.html.
If you want to have a rough idea of the time required to solve your problem, you can run your problem on NEOS Server : http://www.neos-server.org/neos/.
When you have a lot of 0-1 variable, you can also contemplate to use Constraint Programming which often suits better for heavy combinatorial problems.
Both MILP and CP algorithms use branch and bound technique, which is faster than naive enumeration.
Cheers

Algorithm to calculate a page importance based on its views / comments

I need an algorithm that allows me to determine an appropriate <priority> field for my website's sitemap based on the page's views and comments count.
For those of you unfamiliar with sitemaps, the priority field is used to signal the importance of a page relative to the others on the same website. It must be a decimal number between 0 and 1.
The algorithm will accept two parameters, viewCount and commentCount, and will return the priority value. For example:
GetPriority(100000, 100000); // Damn, a lot of views/comments! The returned value will be very close to 1, for example 0.995
GetPriority(3, 2); // Ok not many users are interested in this page, so for example it will return 0.082
You mentioned doing this in an SQL query, so I'll give samples in that.
If you have a table/view Pages, something like this
Pages
-----
page_id:int
views:int - indexed
comments:int - indexed
Then you can order them by writing
SELECT * FROM Pages
ORDER BY
(0.3+LOG10(10+views)/LOG10(10+(SELECT MAX(views) FROM Pages))) +
(0.7+LOG10(10+comments)/LOG10(10+(SELECT MAX(comments) FROM Pages)))
I've deliberately chosen unequal weighting between views and comments. A problem that can arise with keeping an equal weighting with views/comments is that the ranking becomes a self-fulfilling prophecy - a page is returned at the top of the list, so it's visited more often, and thus gets more points, so it's shown at the stop of the list, and it's visited more often, and it gets more points.... Putting more weight on on the comments reflects that these take real effort and show real interest.
The above formula will give you ranking based on all-time statistics. So an article that amassed the same number of views/comments in the last week as another article amassed in the last year will be given the same priority. It may make sense to repeat the formula, each time specifying a range of dates, and favoring pages with higher activity, e.g.
0.3*(score for views/comments today) - live data
0.3*(score for views/comments in the last week)
0.25*(score for views/comments in the last month)
0.15*(score for all views/comments, all time)
This will ensure that "hot" pages are given higher priority than similarly scored pages that haven't seen much action lately. All values apart from today's scores can be persisted in tables by scheduled stored procedures so that the database isn't having to aggregate many many comments/view stats. Only today's stats are computed "live". Taking it one step further, the ranking formula itself can be computed and stored for historical data by a stored procedure run daily.
EDIT: To get a strict range from 0.1 to 1.0, you would motify the formula like this. But I stress - this will only add overhead and is unecessary - the absolute values of priority are not important - only their relative values to other urls. The search engine uses these to answer the question, is URL A more important/relevant than URL B? It does this by comparing their priorities - which one is greatest - not their absolute values.
// unnormalized - x is some page id
un(x) = 0.3*log(views(x)+10)/log(10+maxViews()) +
0.7*log(comments(x)+10)/log(10+maxComments())
// the original formula (now in pseudo code)
The maximum will be 1.0, the minimum will start at 1.0 and move downwards as more views/comments are made.
we define un(0) as the minimum value, i.e. (where views(x) and comments(x) are both 0 in the above formula)
To get a normalized formula from 0.1 to 1.0, you then compute n(x), the normalized priority for page x
(1.0-un(x)) * (un(0)-0.1)
n(x) = un(x) - ------------------------- when un(0) != 1.0
1.0-un(0)
= 0.1 otherwise.
Priority = W1 * views / maxViewsOfAllArticles + W2 * comments / maxCommentsOfAllArticles
with W1+W2=1
Although IMHO, just use 0.5*log_10(10+views)/log_10(10+maxViews) + 0.5*log_10(10+comments)/log_10(10+maxComments)
What you're looking for here is not an algorithm, but a formula.
Unfortunately, you haven't really specified the details of what you want, so there's no way we can provide the formula to you.
Instead, let's try to walk through the problem together.
You've got two incoming parameters, the viewCount and the commentCount. You want to return a single number, Priority. So far, so good.
You say that Priority should range between 0 and 1, but this isn't really important. If we were to come up with a formula we liked, but resulted in values between 0 and N, we could just divide the results by N-- so this constraint isn't really relevant.
Now, the first thing we need to decide is the relative weight of Comments vs Views.
If page A has 100 comments and 10 views, and page B has 10 comments and 100 views, which should have a higher priority? Or, should it be the same priority? You need to decide what's right for your definition of Priority.
If you decide, for example, that comments are 5 times more valuable than views, then we can begin with a formula like
Priority = 5 * Comments + Views
Obviously, this can be generalized to
Priority = A * Comments + B * Views
Where A and B are relative weights.
But, sometimes we want our weights to be exponential instead of linear, like
Priority = Comment ^ A + Views ^ B
which will give a very different curve than the earlier formula.
Similarly,
Priority = Comment ^ A * Views ^ B
will give higher value to a page with 20 comments and 20 views than one with 1 comment and 40 views, if the weights are equal.
So, to summarize:
You really ought to make a spreadsheet with sample values for Views and Comments, and then play around with various formulas until you get one that has the distribution that you are hoping for.
We can't do it for you, because we don't know how you want to value things.
I know it has been a while since this was asked, but I encountered a similar problem and had a different solution.
When you want to have a way to rank something, and there are multiple factors that you're using to perform that ranking, you're doing something called multi-criteria decision analysis. (MCDA). See: http://en.wikipedia.org/wiki/Multi-criteria_decision_analysis
There are several ways to handle this. In your case, your criteria have different "units". One is in units of comments, the other is in units of views. Futhermore, you may want to give different weight to these criteria based on whatever business rules you come up with.
In that case, the best solution is something called a weighted product model. See: http://en.wikipedia.org/wiki/Weighted_product_model
The gist is that you take each of your criteria and turn it into a percentage (as was previously suggested), then you take that percentage and raise it to the power of X, where X is a number between 0 and 1. This number represents your weight. Your total weights should add up to one.
Lastly, you multiple each of the results together to come up with a rank. If the rank is greater than 1, than the numerator page has a higher rank than the denominator page.
Each page would be compared against every other page by doing something like:
p1C = page 1 comments
p1V = page 1 view
p2C = page 2 comments
p2V = page 2 views
wC = comment weight
wV = view weight
rank = (p1C/p2C)^(wC) * (p1V/p2V)^(wV)
The end result is a sorted list of pages according to their rank.
I've implemented this in C# by performing a sort on a collection of objects implementing IComparable.
What several posters have essentially advocated without conceptual clarification is that you use linear regression to determine a weighting function of webpage view and comment counts to establish priority.
This technique is pretty easy to implement for your problem, and the basic concept is described well in this Wikipedia article on linear regression models.
A quick summary of how to apply it to your problem is:
Determine the parameters of the line which best fits the view and comment count data for all your site's webpages, i.e., use linear regression.
Use the line parameters to derive your priority function for the view/count parameters.
Code examples for basic linear regression should not be hard to track down if you don't want to implement it from scratch from basic math formulas (use the web, Numerical Recipes, etc.). Also, any general math software package like Matlab, R, etc., comes with linear regression functions.
The most naive approach would be the following:
Let v[i] the views of page i, c[i] the number of comments for page i, then define the relative view weight for page i to be
r_v(i) = v[i]/(sum_j v[j])
where sum_j v[j] is the total of the v[.] over all pages. Similarly define the relative comment weight for page i to be
r_c(i) = c[i]/(sum_j c[j]).
Now you want some constant parameter p: 0 < p < 1 which indicates the importance of views over comments: p = 0 means only comments are significant, p = 1 means only views are significant, and p = 0.5 gives equal weight.
Then set the priority to be
p*r_v(i) + (1-p)*r_c(i)
This might be over-simplistic but its probably the best starting point.

Resources