Genetic Algorithm on modified knapsack problem - genetic-algorithm

Let’s say, you are going to spend a month in the wilderness. The only thing you are carrying is a
backpack that can hold a maximum weight of 40 kg. Now you have different survival items, each
having its own “Survival Points” (which are given for each item in the table). Some of the items are so
essential that if you do not take them, you incur some additional penalty.
Here is the table giving details about each item.
Item Weight Survival Value Penalty if not taken:
Sleeping Bag 30 20 0,
Rope 10 10 0,
Bottle 5 20 0,
Torch+Battery 15 25 -20,
Glucose 5 30 0,
Pocket Knife 10 15 -10,
Umbrella 20 10 0
.Formulate this as a genetic algorithm problem where your objective is to maximize the survival points.
Write how you would represent the chromosomes, fitness function, crossover, mutation, etc.
I am not sure about what would be the fitness function. A simple fitness function that I thought of is just simply adding the survival points of the weights that we want to take and subtracting the penalties of the weights that we don't want to take. But by doing this the overall value of the fitness function for a particular gene can be negative as well.
Please tell me how should I proceed further and what should be an appropriate fitness function in this case.

Related

algo problem - fewest choice that maximizes score

The problem statement is as follows:
science tournament is taking place where each team can design solar vehicle with following scoring system:
you can submit 0 < D < 20 different designs to compete
each design will be given different score penalty (based on weight, dimension, production cost, material used, etc)
for every design submitted, penalty score will be applied immediately and the total penalty score will serve as each team's starting score. more design you submit, the higher the penalty score will be
each design will run all available terrains, once each
each design will be given score 0 - 100 based on how many km it travels in an hour on different terrains. there will be 0 < T < 30 terrains
we get only 1 score from each terrain. if there are multiple designs in the same terrain, the highest score will be awarded
D #
penalty score
T1
T2
T3
D1
1
10
10
7
D2
2
8
8
12
D3
3
15
16
8
if we submit all 3 designs, the total penalty score for our team is 6, making our initial score -6
and our terrain scores:
T1 -> D3 15 points
T2 -> D3 16 points
T3 -> D2 12 points
penalty -> -6
-------------------------+
total -> 37 points
D1, even if its penalty score is the lowest, is actually useless, and we dont need to submit it in the first place, thus, we can score 38 points if we only submitted D2 and D3. we need to find the highest score we can get given D designs and T terrains. we can pick and choose which design(s) we want to submit into the tournament.
brute force will give you Big O of D!
is there any better way to solve this?
Thanks
This problem is NP hard.
To show that, let's reduce the set cover problem to this one.
Let's assign one terrain per element, and one design per set. Every design has a penalty of 1. A design will perform 1 on a terrain that is not in its set, and will perform 2 * number_of_designs on one that is. It is straightforward to prove that the optimal tournament submission is the smallest set of designs corresponding to a set cover in the original design. So if we can solve your problem efficiently, then we can find the minimal set cover.
I would suggest attempting some kind of branch and bound algorithm to solve this. Either exactly or heuristically.

Probability Events - Total Probability of An Event Happening

I can't seem to work out the below probability question. Wonder if anyone can help? Thanks in advance!
"10 people have made a booking for a shuttle transfer and the departure time is 10am. If
all customers arrive early, the shuttle will depart earlier. The shuttle will wait until
10:10am if any customers are late. If the probability of customers arriving 10 min earlier
is 0.1, 5 min earlier is 0.2, right at 10am is 0.5, late for 5 min is 0.1, late for 10 min is
0.05, and late for over 10min is 0.05. What is the probability for the shuttle to depart on
or before 10am?"
This can be expressed in a binomial distribution if you let p=0.8 be the probability that an individual arrives on time (0.1+0.2+0.5), and let n=10 denote the number of trials.
The probability mass function (PMF) of a binomial distribution is defined as
That is, the probability of observing k successes in n trials with probability p.
Now, the probability whether the shuttle will depart on (or before) time, is equivalent to the probability that all individuals arrive at least on time, that is P(X = k) where k=10 successes.
Using the above PMF we obtain

Given a list of 2D points and a square grid size, return the coordinate closest to the most points

Here's a summarized problem statement from an interview I had:
There is an n x n grid representing a city, along with a list of k
3-tuples (x, y, w), where (x, y) is the coordinate of an event,
and w is the the "worth" of the event. You're also given a radius
r, which represents how far you can see. You derive happiness h from seeing an event, and h=w/d, where d is (1 + Euclidean distance to the event) (to account for 0 distance). If d is greater than r, then the happiness is 0. Output a coordinate (x,y) that has the highest cumulative happiness.
I didn't really know how to approach this problem other than brute forcing through each possible coordinate and calculating the happiness at each point, recording the max. I also thought about calculating the center of mass of the points and finding the closest integer coordinates to the center of mass, but that doesn't properly take into account the "worth" of the event.
What's the best way to approach this problem?
(I can't see an obvious best algorithm or data structure for this; it could be one of those questions where they wanted to hear your thought process more than your solution.)
Of the two obvious approaches:
Iterating over all locations and measuring the distance to all events to calculate the location's worth
Iterating over all events and adding to the worth of the locations in the circle around them
the latter seems to be the most efficient one. You're never looking at worthless locations, and to distribute the worth, you only need to calculate one octant of the circle and then mirror it for the rest of the circle.
You obviously need the memory space to store a rectangular grid of the locations' worth, so that's a consideration. And if you don't know the city size beforehand, you'd have to iterate over the input once just to choose the grid size. (In contrast, the first method would require almost no memory space).
Time complexity-wise, you'd iterate over k events, and for each of these you'd have to calculate the worth of a number of locations related to r2. You can keep a maximum while you're iterating over the events, so finding the maximum value doesn't add to the time complexity. (In the first method, you'd obviously have to calculate all the same w/(d+1) values, without the advantage of mirroring one octant of a circle, plus at least the distance of all the additional worthless locations.)
If the number of events and the affected regions around them are small compared to the city size, the advantage of the second method is obvious. If there are a large number of events and/or r is large, the difference may not be significant.
There may be some mathematical tricks to decide which events to check first, or which to ignore, or when to stop, but you'd have to know some more details for that, e.g. whether two events can happen at the same location. There could e.g. be an advantage in sorting the events by worth and looking at the events with the most worth first, because at some point it may become obvious that events outside of a "hot spot" around the current maximum can be ignored. But much would depend on the specifics of the data.
UPDATE
When distributing the worth of an event over the locations around it, you obviously don't have to calculate the distances more than once; e.g. if r = 3 you'd make this 7×7 grid with 1/d weights:
0 0 0 0.250 0 0 0
0 0.261 0.309 0.333 0.309 0.261 0
0 0.309 0.414 0.500 0.414 0.309 0
0.250 0.333 0.500 1.000 0.500 0.333 0.250
0 0.309 0.414 0.500 0.414 0.309 0
0 0.261 0.309 0.333 0.309 0.261 0
0 0 0 0.250 0 0 0
Which contains only eight different values. Then you'd use this as a template to overlay on top of the grid at the location of an event, and multiply the event's worth with the weights and add them to each location's worth.
UPDATE
I considered the possibility that only locations with an event could be the location with the highest worth, and without the limit r that would be true. That would make the problem quite different. However, it's easy to create a counter-example; consider e.g. these events:
- - 60 - -
- - - - -
60 - - - 60
- - - - -
- - 60 - -
With a limit r greater than 4, they would create this worth in the locations around them:
61.92 73.28 103.3 73.28 61.92
73.28 78.54 82.08 78.54 73.28
103.3 82.08 80.00 82.08 103.3
73.28 78.54 82.08 78.54 73.28
61.92 73.28 103.3 73.28 61.92
And the locations with the highest worth 103.3 are the locations of the events. However, if we set the limit r = 2, we get:
40 30 60 30 40
30 49.7 30 49.7 30
60 30 80 30 60
30 49.7 30 49.7 30
40 30 60 30 40
And the location in the middle, which doesn't have an event, is now the location of maximum worth 80.
This means that locations without events, at least those within the convex hull around a cluster of events, have to be considered. Of course, if two clusters of events are found to be more than 2 × r away from each other, they can be treated as separate zones. In that case, you wouldn't have to create a grid for the whole city, but separate smaller grids around every cluster.
So the overall approach would be:
Create the square grid of size 2 × r with the weights.
Separate the events into clusters with a distance of more than 2 × r between them.
For each cluster of events, create the smallest rectangular grid that fits around the events.
For each event, use the weight grid to distribute worth over the rectangular grid.
While adding worth to locations, keep track of the maximum worth.

What is the importance of average in performance metrics?

As part of the performance tuning and load tests we usually do, i am forced to believe that we need to look at 90th percentiles. As per my understanding 90 times out of hundred people got a respone which is equal to or better than the 90th percentile number. However my current clients always look at average number.What is the impact of only looking at average? Most of the times, I see that between two tests, if average is lower in test A , then 90th percentile is also less in test A.
So should we match the SLA on average or on 90th percentile?
I agree that this is not a pure programming question. But in my humble opinion, program performance and statistics are closely related anyway. That's why I think this question deserves an answer.
The two are different in nature. We then have the average - the sum of all observations divided by the number of observations. We also have the median or 50 percentile - half of the observations are above, half are below.
There's a very visible difference in the two if your observations do not match the bell curve: e.g. if you have positive outliers but no negative outliers.
Let's do a few number examples:
observations 2 4 6 8 - average and median are both 5
observations 1 1 10 - average is 4, median is 1.
Your 50 percentile could be argued on any number between 1 and 10 here: two observations are below, one is above for any of these numbers.
observations 1 4 1000 - average is 335, median is 4, 50 percentile is also 4.
As you can see, the distribution of the numbers matters a lot.
Only if you have a symmetrical distribution (like a Gaussian bell curve), the average equals the 50 percentile.
But you asked for the 90 percentile.
Essentially, nothing changes - the distribution, the number of outliers and the most often observed values affect your percentile.
I suggest to pick up a good book on statistics if you need to know more.

Optimal sequence of non-overlapping purchases

I think this is a scheduling problem, but I'm not even sure on that much! What I want is to find the optimal sequence of non-overlapping purchase decisions, when I have full knowledge of their value and what opportunities are coming up in the future.
Imagine a wholesaler who sells various goods that I want to buy for my own shop. At any time they may have multiple special offers running; I will sell at full price, so their discount is my profit.
I want to maximize profit, but the catch is that I can only buy one thing at a time, and no such thing as credit, and worse, there is a delivery delay. The good news is I will sell the items as soon as they are delivered, and I can then go spend my money again. So, one path through all the choices might be: I buy 100kg apples on Monday, they are delivered on Tuesday. I then buy 20 nun costumes delivered, appropriately enough, on Sunday. I skip a couple of days, as I know on Wednesday they'll have a Ferrari at a heavy discount. So I buy one of those, it is delivered the following Tuesday. And so on.
You can consider compounding profits or not. The algorithm comes down to a decision at each stage between choosing one of today's special offers, or waiting a day because something better is coming tomorrow.
Let's abstract that a bit. Buy and delivery become days-since-epoch. Profit is written as sell-price divided by buy-price. I.e. 1.00 means break-even, 1.10 means a 10% profit, 2.0 means I doubled my money.
buy delivery profit
1 2 1.10 Apples
1 3 1.15 Viagra
2 3 1.15 Notebooks
3 7 1.30 Nun costumes
4 7 1.28 Priest costumes
6 7 1.09 Oranges
6 8 1.11 Pears
7 9 1.16 Yellow shoes
8 10 1.15 Red shoes
10 15 1.50 Red Ferrari
11 15 1.40 Yellow Ferrari
13 16 1.25 Organic grapes
14 19 1.30 Organic wine
NOTES: opportunities exist only on the buy day (e.g. the organic grapes get made into wine if no-one buys them!), and I get to sell on the same day as delivery, but cannot buy my next item until the following day. So I cannot sell my nun costumes at t=7 and immediately buy yellow shoes at t=7.
I was hoping there exists a known best algorithm, and that there is already an R module for it, but algorithms or academic literature would also be good, as would anything in any other language. Speed matters, but mainly when the data gets big, so I'd like to know if it is O(n2), or whatever.
By the way, does the best algorithm change if there is a maximum possible delivery delay? E.g. if delivery - buy <= 7
Here is the above data as CSV:
buy,delivery,profit,item
1,2,1.10,Apples
1,3,1.15,Viagra
2,3,1.15,Notebooks
3,7,1.30,Nun costumes
4,7,1.28,Priest costumes
6,7,1.09,Oranges
6,8,1.11,Pears
7,9,1.16,Yellow shoes
8,10,1.15,Red shoes
10,15,1.50,Red Ferrari
11,15,1.40,Yellow Ferrari
13,16,1.25,Organic grapes
14,19,1.30,Organic wine
Or as JSON:
{"headers":["buy","delivery","profit","item"],"data":[[1,2,1.1,"Apples"],[1,3,1.15,"Viagra"],[2,3,1.15,"Notebooks"],[3,7,1.3,"Nun costumes"],[4,7,1.28,"Priest costumes"],[6,7,1.09,"Oranges"],[6,8,1.11,"Pears"],[7,9,1.16,"Yellow shoes"],[8,10,1.15,"Red shoes"],[10,15,1.5,"Red Ferrari"],[11,15,1.4,"Yellow Ferrari"],[13,16,1.25,"Organic grapes"],[14,19,1.3,"Organic wine"]]}
Or as an R data frame:
structure(list(buy = c(1L, 1L, 2L, 3L, 4L, 6L, 6L, 7L, 8L, 10L,
11L, 13L, 14L), delivery = c(2L, 3L, 3L, 7L, 7L, 7L, 8L, 9L,
10L, 15L, 15L, 16L, 19L), profit = c(1.1, 1.15, 1.15, 1.3, 1.28,
1.09, 1.11, 1.16, 1.15, 1.5, 1.4, 1.25, 1.3), item = c("Apples",
"Viagra", "Notebooks", "Nun costumes", "Priest costumes", "Oranges",
"Pears", "Yellow shoes", "Red shoes", "Red Ferrari", "Yellow Ferrari",
"Organic grapes", "Organic wine")), .Names = c("buy", "delivery",
"profit", "item"), row.names = c(NA, -13L), class = "data.frame")
LINKS
Are there any R Packages for Graphs (shortest path, etc.)? (igraph offers a shortest.paths function and in addition to the C library, has an R package and a python interface)
The simplest way to think of this problem is as analogous to a shortest-path problem (although treating it as a maximum flow problem probably is technically better). The day numbers, 1 ... 19, can be used as node names; each node j has a link to node j+1 with weight 1, and each product (b,d,g,p) in the list adds a link from day b to day d+1 with weight g. As we progress through the nodes when path-finding, we keep track of the best multiplied values seen so far at each node.
The Python code shown below runs in time O(V+E) where V is the number of vertices (or days), and E is the number of edges. In this implementation, E = V + number of products being sold. Added note: The loop for i, t in enumerate(graf): treats each vertex once. In that loop, for e in edges: treats edges from the current vertex once each. Thus, no edge is treated more than once, so performance is O(V+E).
Edited note 2: krjampani claimed that O(V+E) is slower than O(n log n), where n is the number of products. However, the two orders are not comparable unless we make assumptions about the number of days considered. Note that if delivery delays are bounded and product dates overlap, then number of days is O(n) whence O(V+E) = O(n), which is faster than O(n log n).
However, under a given set of assumptions the run time orders of my method and krjampani's can be the same: For large numbers of days, change my method to create graph nodes only for days in the sorted union of x[0] and x[1] values, and using links to day[i-1] and day[i+1] instead of to i-1 and i+1. For small numbers of days, change krjampani's method to use an O(n) counting sort.
The program's output looks like the following:
16 : 2.36992 [11, 15, 1.4, 'Yellow Ferrari']
11 : 1.6928 [8, 10, 1.15, 'Red shoes']
8 : 1.472 [4, 7, 1.28, 'Priest costumes']
4 : 1.15 [1, 3, 1.15, 'Viagra']
which indicates that we arrived at day 16 with compounded profit of 2.36992, after selling Yellow Ferrari's on day 15; arrived at day 11 with profit 1.6928, after selling Red shoes; and so forth. Note, the dummy entry at the beginning of the products list, and removal of quotes around the numbers, are the main differences vs the JSON data. The entry in list element graf[j] starts out as [1, j-1, 0, [[j+1,1,0]]], that is, is of form [best-value-so-far, best-from-node#, best-from-product-key, edge-list]. Each edge-list is a list of lists which have form [next-node#, edge-weight, product-key]. Having product 0 be a dummy product simplifies initialization.
products = [[0,0,0,""],[1,2,1.10,"Apples"],[1,3,1.15,"Viagra"],[2,3,1.15,"Notebooks"],[3,7,1.30,"Nun costumes"],[4,7,1.28,"Priest costumes"],[6,7,1.09,"Oranges"],[6,8,1.11,"Pears"],[7,9,1.16,"Yellow shoes"],[8,10,1.15,"Red shoes"],[10,15,1.50,"Red Ferrari"],[11,15,1.40,"Yellow Ferrari"],[13,16,1.25,"Organic grapes"],[14,19,1.30,"Organic wine"]]
hiDay = max([x[1] for x in products])
graf = [[1, i-1, 0, [[i+1,1,0]]] for i in range(2+hiDay)]
for i, x in enumerate(products):
b, d, g, p = x[:]
graf[b][3] += [[d+1, g, i]] # Add an edge for each product
for i, t in enumerate(graf):
if i > hiDay: break
valu = t[0] # Best value of path to current node
edges = t[3] # List of edges out of current node
for e in edges:
link, gain, prod = e[:]
v = valu * gain;
if v > graf[link][0]:
graf[link][0:3] = [v, i, prod]
day = hiDay
while day > 0:
if graf[day][2] > 0:
print day, ":\t", graf[day][0], products[graf[day][2]]
day = graf[day][1]
This problem maps naturally to the problem of finding the maximum weight independent intervals among a set of weighted intervals. Each item in your input set corresponds to an interval whose start and end points are the buy and delivery dates and the item's profit represents the weight of the interval. The maximum weight independent intervals problem is to find a set of disjoint intervals whose total weight is the maximum.
The problem can be solved in O(n log n) as follows. Sort the intervals by their end points (see the figure). We then travel through each interval i in the sorted list and compute the optimal solution for the subproblem that consists of intervals from 1...i in the sorted list. The optimal solution of the problem for intervals 1...i is the maximum of:
1. The optimal solution of the problem for intervals `1...(i-1)` in the
sorted list or
2. Weight of interval `i` + the optimal solution of the problem for intervals
`1...j`, where j is the last interval in the sorted list whose end-point
is less than the start-point of `i`.
Note that this algorithm runs in O(n log n) and computes the value of the optimal solution for every prefix of the sorted list.
After we run this algorithm, we can travel through the sorted-list in reverse order and find the intervals present in the optimal solution based on the values computed for each prefix.
EDIT:
For this to work correctly the weights of the intervals should be the actual profits of the corresponding items (i.e. they should be sell_price - buy_price).
Update 2: Running time
Let V be the number of days (following jwpat7's notation).
If V is much smaller than O(n log n), we can use the counting sort to sort the intervals in O(n + V) time and use an array of size V to record the solutions to the subproblems. This approach results in a time complexity of O(V + n).
So the running time of the algorithm is min(O(V+n), O(n log n)).
This is a dynamic programming problem. Making an overall optimal choice only requires making optimal choices at each step. You can make a table that describes the optimal choice at each step based on the previous state and the profit of taking various steps from that state. You can collapse a large set of possibilities into a smaller set by eliminating the possibilities that are clearly non-optimal as you go.
In your problem, the only state that affects choices is the delivery date. For example, on day one, you have three choices: You can buy apples, set your profit to 1.10, and set your delivery date to 2; buy viagra, set your profit to 1.15, and set your delivery date to 3; or buy nothing, set your profit to zero, and set your delivery date to 2. We can represent these alternatives like this:
(choices=[apples], delivery=2, profit=1.10) or
(choices=[viagra], delivery=3, profit=1.15) or
(choices=[wait], delivery=2, profit=0.00)
It isn't going to make any difference whether you buy viagra or buy nothing on the first day as far as making future decisions. Either way, the next day you can make a purchase is day two, so you can eliminate waiting as an alternative since the profit is lower. However, if you buy apples, that will affect future decisions differently than if you buy viagra or wait, so it is a different alternative you have to consider. That just leaves you with these alternatives at the end of day one.
(choices=[apples], delivery=2, profit=1.10) or
(choices=[viagra], delivery=3, profit=1.15)
For day two, you need to consider your alternatives based on what the alternatives were on day one. This produces three possibilties:
(choices=[apples,notebooks], delivery=3, profit=2.25) or
(choices=[apples,wait], delivery=3, profit=1.10) or
(choices=[viagra,wait], delivery=3, profit=1.15)
All three of these choices put you in the same state as far as future decisions are considered, since they all put the delivery date at 3, so you simply choose the one with maximum profit:
(choices=[apples,notebooks], delivery=3, profit=2.25)
Going on to day three, you have two alternatives
(choices=[apples,notebooks,wait], delivery=4, profit=2.25)
(choices=[apples,notebooks,nun costumes], delivery=7, profit=3.55)
both of these alternatives have to be kept, since they will affect future decisions in different ways.
Note that we're just making future decisions based on the delivery date and the profit. We keep track of the choices just so that we can report the best set of choices at the end.
Now maybe you can see the pattern. You have a set of alternatives, and whenever you have multiple alternatives that have the same delivery date, you just choose the one with the maximum profit and eliminate the others. This process of collapsing your alternatives keeps the problem from growing exponentially, allowing it to be solved efficiently.
You can solve this as a linear programming problem. This is the standard approach to solving logistics problems, such as those faced by airlines and corporations, with much larger problem spaces than yours. I won't formally define your problem here, but in broad terms: Your objective function is the maximisation of profit. You can represent the buy days, and the "only one purchase per day" as linear constraints.
The standard linear programming algorithm is the simplex method. Although it has exponential worst case behaviour, in practice, it tends to be very efficient on real problems. There are lots of good freely available implementations. My favourite is the GNU Linear Programming Kit. R has an interface to GLPK. Lp_solve is another well-known project, which also has an R interface. The basic approach in each case is to formally define your problem, then hand that off to the third party solver to do its thing.
To learn more, I recommend you take a look at the Algorithm Design Manual, which should give you enough background to do further research online. p.412 onwards is a thorough summary of linear programming and its variations (e.g. integer programming if you have integrality constraints).
If you've never heard of linear programming before, you might like to take a look at some examples of how it can be used. I really like this simple set of tutorial problems in Python. They include maximising profit on tins of cat food, and solving a Sudoku problem.

Resources