how to represent values of stock in a polynom? - genetic-algorithm

i'm doing a project in genetic algorithms and we need to build a software that chooses set of stocks based on their history.
we need to do it on genetic programming which means we need a fitness function and a chromosome.
right i thought to the fitness function by the positive diffrence between the avarge history of the stock and it real value.(so if it's matched it will be 0 ).
does anyone have any idea how to express the chromosome?

The problem doesn't seem to be well-defined. The fitness function you mentioned would give you a selection of stocks whose prices hover around their actual values, provided you know the actual value of the stocks.
Other possibilities:
First scenario:You are trying to select a set of the most promising stocks based on its historical performance i.e. maximize expected return and/or minimize variance/risk. If the number of possible stocks to choose from is not large, simplest option is to have a binary string: 0 representing no selection and 1 representing selection. The position corresponds to the index of the stock. If you have a very large number of possible stocks to choose from, you can encode the labels/indices of the stocks as your chromosome. This might mean a variable-length chromosome if you do not have a maximum cap on the number of stocks to be selected, and it would be harder to code.
Fitness function (to be maximized) would be the sum of (expected return - standard deviation) of selected stocks. The expected return could be formulated in two ways: expected future price - current price, or current price - underlying value (if you know the underlying value, that is). Expected future price can be estimated from historical data (e.g. fit a simple curve of ur choice, or apply ARIMA and extend to next time points). The standard deviation can be estimated directly from historical data.
If your chromosome is binary (values are 0/1), once you have the expected return and standard deviation, a simple dot product would do the computation needed. I suppose there may be a cap also on the number of stocks selected, in which case you have a constrained optimization problem. You can represent constraints as penalties in the fitness.
The problem is essentially a binary integer linear program (BILP) and you can benchmark the GA against other bilp solvers. With a decent mixed integer linear programming solver (e.g. symphony, gurobi, ibm cplex,etc), you can usually solve large problems faster than with a GA.
Second scenario: You are trying to find how many of what stocks to buy at current price to maximise expected return . Your chromosome here would be non negative integers, unless you want to represent shorting. The fitness would still be the same as in item (1), i.e. sum of prices of selected stocks, averaged over time, minus standard deviation of historical prices of selected stocks over time. The problem becomes an integer linear programming problem. Everything else is the same as in item (1). Again, if the number of stocks from which you can choose is large, you will find that a MILP solver would serve you much, much better than a GA.
Further, GP (genetic programming) is sufficiently different from GA.
If you are trying to evolve a stock selection strategy, or an expression that predicts stock prices in the future, you actually a GP. For the stock selection problem, a ga wld b sufficient.

Related

Pointwise vs. pairwise Learning-to-rank on DATA WITH BINARY RELEVANCE VALUES

I have two question about the differences between pointwise and pairwise learning-to-rank algorithms on DATA WITH BINARY RELEVANCE VALUES (0s and 1s). Suppose the loss function for a pairwise algorithm calculates the number of times an entry with label 0 gets ranked before an entry with label 1, and that for a pointwise algorithm calculates the overall differences between the estimated relevance values and the actual relevance values.
So my questions are: 1) theoretically, will the two groups of algorithms perform significantly differently? 2) will a pairwise algorithm degrade to pointwise algorithm in such settings?
thanks!
In point wise estimation the errors across rows in your data (rows with items and users, you want to rank items within each user/query) are assumed to be independent sort of like normally distributed errors. Whereas in pair wise evaluation the algorithm loss function often used is cross entropy - a relative measure of accurately classifying 1's as 1's and 0's as 0s in each pair (with information - i.e. one of the item is better than other within the pair).
So changes are that the pair wise is likely to learn better than point-wise.
Only exception I could see is a business scenario when users click items without evaluating/comparing items from one another per-say. This is highly unlikely though.

How to valorize better offsprings better than with my roulette selection method?

I am playing around with genetic programming algorithms, and I want to know how I can valorize and make sure my best exemplares reproduce more by substituting or improving the way I choose which one will reproduce. Currently the method I use looks like this:
function roulette(population)
local slice = sum_of_fitnesses(population) * math.random()
local sum = 0
for iter = 1, #population do
sum = sum + population[iter].fitness
if sum >= slice then
return population[iter]
end
end
end
But I can't get my population to reach an average fitness which is above a certain value and I worry it's because of less fit members reproducing with more fit members and thus continuing to spread their weak genes around.
So how can I improve my roulette selection method? Or should I use a completely different fitness proportionate selector?
There are a couple of issues at play here.
You are choosing the probability of an individual replicating based on its fitness, so the fitness function that you are using needs to exaggerate small differences or else having a minor decrease in fitness isn't so bad. For example, if a fitness drops from 81 to 80, this change is probably within the noise of the system and won't make much of a different to evolution. It will certainly be almost impossible to climb to a very high fitness if a series of small changes need to be made because the selective pressure simply won't be strong enough.
The way you solve this problem is by using something like tournament selection. In it's simplest form, every time you want to choose another individual to be born, you pick K random individuals (K is known and the "tournament size"). You calculate the fitness of each individual and whomever has the highest fitness is replicated. It doesn't matter if the fitness difference is 81 vs 80 or if its 10000 vs 2, since it simply takes the highest fitness.
Now the question is: what should you set K to? K can be thought of as the strength of selection. If you set it low (e.g., K=2) then many low fitness individuals will get lucky and slip through, being competed against other low-fitness individuals. You'll get a lot of diversity, but very little section. On the flip side, if you set K to be high (say, K=100), you're ALWAYS going to pick one of the highest fitnesses in the population, ensuring that the population average is driven closer to the max, but also driving down diversity in the population.
The particular tradeoff here depends on the specific problem. I recommend trying out different options (including your original algorithm) with a few different problems to see what happens. For example, try the all-ones problem: potential solutions are bit strings and a fitness is simply the number of 1's. If you have weak selection (as in your original example, or with K=2), you'll see that it never quite gets to a perfect all-ones solution.
So, why not always use a high K? Well consider a problem where ones are negative unless they appear in a block of four consecutive ones (or eight, or however many), when they suddenly become very positive. Such a problem is "deceptive", which means that you need to explore through solutions that look bad in order to find ones that are good. If you set your strength of selection too high, you'll never collect three ones for that final mutation to give you the fourth.
Lots of more advanced techniques exist that use tournament selection that you might want to look at. For example, varying K over time, or even within a population, select some individuals using a low K and others using a high K. It's worth reading up on some more if you're planning to build a better algorithm.

Why do you need fitness scaling in Genetic Algorithms?

Reading the book "Genetic Algorithms" by David E. Goldberg, he mentions fitness scaling in Genetic Algorithms.
My understanding of this function is to constrain the strongest candidates so that they don't flood the pool for reproduction.
Why would you want to constrain the best candidates? In my mind having as many of the best candidates as early as possible would help get to the optimal solution as fast as possible.
What if your early best candidates later on turn out to be evolutionary dead ends? Say, your early fittest candidates are big, strong agents that dominate smaller, weaker candidates. If all the weaker ones are eliminated, you're stuck with large beasts that maybe have a weakness to an aspect of the environment that hasn't been encountered yet that the weak ones can handle: think dinosaurs vs tiny mammals after an asteroid impact. Or, in a more deterministic setting that is more likely the case in a GA, the weaker candidates may be one or a small amount of evolutionary steps away from exploring a whole new fruitful part of the fitness landscape: imagine the weak small critters evolving flight, opening up a whole new world of possibilities that the big beasts most likely will never touch.
The underlying problem is that your early strongest candidates may actually be in or around a local maximum in fitness space, that may be difficult to come out of. It could be that the weaker candidates are actually closer to the global maximum.
In any case, by pruning your population aggressively, you reduce the genetic diversity of your population, which in general reduces the search space you are covering and limits how fast you can search this space. For instance, maybe your best candidates are relatively close to the global best solution, but just inbreeding that group may not move it much closer to it, and you may have to wait for enough random positive mutations to happen. However, perhaps one of the weak candidates that you wanted to cut out has some gene that on its own doesn't help much, but when crossed with the genes from your strong candidates in may cause a big evolutionary jump! Imagine, say, a human crossed with spider DNA.
#sgvd's answer makes valid points but I would like to elaborate more.
First of all, we need to define what fitness scaling actually means. If it means just multiplying the fitnesses by some factor then this does not change the relationships in the population - if the best individual had 10 times higher fitness than the worst one, after such multiplication this is still true (unless you multiply by zero which makes no real sense). So, a much more sensible fitness scaling is an affine transformation of the fitness values:
scaled(f) = a * f + b
i.e. the values are multiplied by some number and offset by another number, up or down.
Fitness scaling makes sense only with certain types of selection strategies, namely those where the selection probability is proportional to the fitness of the individuals1.
Fitness scaling plays, in fact, two roles. The first one is merely practical - if you want a probability to be proportional to the fitness, you need the fitness to be positive. So, if your raw fitness value can be negative (but is limited from below), you can adjust it so you can compute probabilities out of it. Example: if your fitness gives values from the range [-10, 10], you can just add 10 to the values to get all positive values.
The second role is, as you and #sgvd already mentioned, to limit the capability of the strongest solutions to overwhelm the weaker ones. The best illustration would be with an example.
Suppose that your raw fitness values gives values from the range [0, 100]. If you left it this way, the worst individuals would have zero probability of being selected, and the best ones would have up to 100x higher probability than the worst ones (excluding the really worst ones). However, let's set the scaling factors to a = 1/2, b = 50. Then, the range is transformed to [50, 100]. And right away, two things happen:
Even the worst individuals have non-zero probability of being selected.
The best individuals are now only 2x more likely to be selected than the worst ones.
Exploration vs. exploitation
By setting the scaling factors you can control whether the algorithm will do more exploration over exploitation and vice versa. The more "compressed"2 the values are going to be after the scaling, the more exploration is going to be done (because the likelihood of the best individuals being selected compared to the worst ones will be decreased). And vice versa, the more "expanded"2 are the values going to be, the more exploitation is going to be done (because the likelihood of the best individuals being selected compared to the worst ones will be increased).
Other selection strategies
As I have already written at the beginning, fitness scaling only makes sense with selection strategies which derive the selection probability proportionally from the fitness values. There are, however, other selection strategies that do not work like this.
Ranking selection
Ranking selection is identical to roulette wheel selection but the numbers the probabilities are derived from are not the raw fitness values. Instead, the whole population is sorted by the raw fitness values and the rank (i.e. the position in the sorted list) is the number you derive the selection probability from.
This totally erases the discrepancy when there is one or two "big" individuals and a lot of "small" ones. They will just be ranked.
Tournament selection
In this type of selection you don't even need to know the absolute fitness values at all, you just need to be able to compare two of them and tell which one is better. To select one individual using tournament selection, you randomly pick a number of individuals from the population (this number is a parameter) and you pick the best one of them. You repeat that as long as you have selected enough individuals.
Here you can also control the exploration vs. exploitation thing by the size of the tournament - the larger the tournament is the higher is the chance that the best individuals will take part in the tournaments.
1 An example of such selection strategy is the classical roulette wheel selection. In this selection strategy, each individual has its own section of the roulette wheel which is proportional in size to the particular individual's fitness.
2 Assuming the raw values are positive, the scaled values get compressed as a goes down to zero and as b goes up. Expansion goes the other way around.

Normalization of a multi-dimensional space, what algorithm is this?

I'm not a trained statistician so I apologize for the incorrect usage of some words. I'm just trying to get some good results from the Weka Nearest Neighbor algorithms. I'll use some redundancy in my explanation as a means to try to get the concept across:
Is there a way to normalize a multi-dimensional space so that the distances between any two instances are always proportional to the effect on the dependent variable?
In other words I have a statistical data set and I want to use a "nearest neighbor" algorithm to find instances that are most similar to a specified test instance. Unfortunately my initial results are useless because two attributes that are very close in value weakly correlated to the dependent variable would incorrectly bias the distance calculation.
For example let's say you're trying to find the nearest-neighbor of a given car based on a database of cars: make, model, year, color, engine size, number of doors. We know intuitively that the make, model, and year have a bigger effect on price than the number of doors. So a car with identical color, door count, may not be the nearest neighbor to a car with different color/doors but same make/model/year. What algorithm(s) can be used to appropriately set the weights of each independent variable in the Nearest Neighbor distance calculation so that the distance will be statistically proportional (correlated, whatever) to the dependent variable?
Application: This can be used for a more accurate "show me products similar to this other product" on shopping websites. Back to the car example, this would have cars of same make and model bubbling up to the top, with year used as a tie-breaker, and then within cars of the same year, it might sort the ones with the same number of cylinders (4 or 6) ahead of the ones with the same number of doors (2 or 4). I'm looking for an algorithmic way to derive something similar to the weights that I know intuitively (make >> model >> year >> engine >> doors) and actually assign numerical values to them to be used in the nearest-neighbor search for similar cars.
A more specific example:
Data set:
Blue,Honda,6-cylinder
Green,Toyota,4-cylinder
Blue,BMW,4-cylinder
now find cars similar to:
Blue,Honda,4-cylinder
in this limited example, it would match the Green,Toyota,4-cylinder ahead of the Blue,Honda,6-cylinder because the two brands are statistically almost interchangeable and cylinder is a stronger determinant of price rather than color. BMW would match lower because that brand tends to double the price, i.e. placing the item a larger distance.
Final note: the prices are available during training of the algorithm, but not during calculation.
Possible you should look at Solr/Lucene for this aim. Solr provides a similarity search based field value frequency and it already has functionality MoreLikeThis for find similar items.
Maybe nearest neighbor is not a good algorithm for this case? As you want to classify discrete values it can become quite hard to define reasonable distances. I think an C4.5-like algorithm may better suit the application you describe. On each step the algorithm would optimize the information entropy, thus you will always select the feature that gives you the most information.
Found something in the IEEE website. The algorithm is called DKNDAW ("dynamic k-nearest-neighbor with distance and attribute weighted"). I couldn't locate the actual paper (probably needs a paid subscription). This looks very promising assuming that the attribute weights are computed by the algorithm itself.

Subset generation by rules

Let's say that we have a 5000 users in database. User row has sex column, place where he/she was born column and status (married or not married) column.
How to generate a random subset (let's say 100 users) that would satisfy these conditions:
40% should be males and 60% - females
50% should be born in USA, 20% born in UK, 20% born in Canada, 10% in Australia
70% should be married and 30% not.
These conditions are independent, that is we cannot do like this:
(0.4 * 0.5 * 0.7) * 100 = 14 users that are males, born in USA and married
(0.4 * 0.5 * 0.3) * 100 = 6 users that are males, born in USA and not married.
Is there an algorithm to this generation?
Does the breakdown need to be exact, or approximate? Typically if you are generating a sample like this then you are doing some statistical study, so it is sufficient to generate an approximate sample.
Here's how to do this:
Have a function genRandomIndividual().
Each time you generate an individual, use the random function to choose the sex - male with probability 40%
Choose birth location using random function again (just generate a real in the interval 0-1, and if it falls 0-.5, choose USA, if .5-.7, then &K, if .7-.9 then Canada, otherwise Australia).
Choose married status using random function (again generate in 0-1, if 0-.7 then married, otherwise not).
Once you have a set of characteristics, search in the database for the first individual who satisfies these characteristics, add them to your sample, and tag it as already added in the database. Keep doing this unti you have fulfilled your sample size.
There may be no individaul that satisfies the characteristics. Then, just generate a new random individual instead. Since the generations are independent and generate the characteristics according to the required probabilities, in the end you will have a sample size of the correct size with the individuals generated randomly according to the probabilities specified.
You could try something like this:
Pick a random initial set of 100
Until you have the right distribution (or give up):
Pick a random record not in the set, and a random one that is
If swapping in the other record gets you closer to the set you want, exchange them. Otherwise, don't.
I'd probaby use the sum of squares of distance to the desired distribution as the metric for deciding whether to swap.
That's what comes to mind that keeps the set random. Keep in mind that there may be no subset which matches the distribution you're after.
It is important to note that you may not be able to find a subset that satisfies these conditions. To take an example, suppose your database contained only American males, and only Australian females. Clearly you could not generate any subset that satisfies your distribution constraints.
(Rewrote my post completely (actually, wrote a new one and deleted the old) because I thought of a much simpler and more efficient way to do the same thing.)
I'm assuming you actually want the exact proportions and not just to satisfy them on average. This is a pretty simple way to accomplish that, but depending on your data it might take a while to run.
First, arrange your original data so that you can access each combination of types easily, that is, group married US men in one pile, unmarried US men in another, and so on. Then, assuming that you have p conditions and you want to select k elements, make p arrays of size k each; one array will represent one condition. Make the elements of each array be the types of that condition, in the proportions that you require. So, in your example, the gender array would have 40 males and 60 females.
Now, shuffle each of the p arrays independently (actually, you can leave one array unshuffled if you like). Then, for each index i, take the type of the picked element to be the combination from the shuffled p arrays at index i, and pick one such type at random from the remaining ones in your original group, removing the picked element. If there are no elements of that type left, the algorithm has failed, so reshuffle the arrays and start again to pick elements.
To use this, you need to first make sure that the conditions are satisfiable at all because otherwise it will just loop infinitely. To be honest, I don't see a simple way to verify that the conditions are satisfiable, but if the number of elements in your original data is large compared to k and their distribution isn't too skewed, there should be solutions. Also, if there are only a few ways in which the conditions can be satisfied, it might take a long time to find one; though the method will terminate with probability 1, there is no upper bound that you can place on the running time.
Algorithm may be too strong a word, since to me that implies formalism and publication, but there is a method to select subsets with exact proportions (assuming your percentages yield whole numbers of subjects from the sample universe), and it's much simpler than the other proposed solutions. I've built one and tested it.
Incidentally, I'm sorry to be a slow responder here, but my time is constrained these days. I wrote a hard-coded solution fairly quickly, and since then I've been refactoring it into a decent general-purpose implementation. Because I've been busy, that's still not complete yet, but I didn't want to delay answering any longer.
The method:
Basically, you're going to consider each row separately, and decide whether it's selectable based on whether your criteria give you room to select each of its column values.
In order to do that, you'll consider each of your column rules (e.g., 40% males, 60% females) as an individual target (e.g., given a desired subset size of 100, you're looking for 40 males, 60 females). Make a counter for each.
Then you loop, until you've either created your subset, or you've examined all the rows in the sample universe without finding a match (see below for what happens then). This is the loop in pseudocode:
- Randomly select a row.
- Mark the row examined.
- For each column constraint:
* Get the value for the relevant column from the row
* Test for selectability:
If there's a value target for the value,
and if we haven't already selected our target number of incidences of this value,
then the row is selectable with respect to this column
* Else: the row fails.
- If the row didn't fail, select it: add it to the subset
That's the core of it. It will provide a subset which matches your rules, or it will fail to do so... which brings me to what happens when we can't find a
match.
Unsatisfiability:
As others have pointed out, it's not always possible to satisfy any arbitrary set of rules for any arbitrary sample universe. Even assuming that the rules are valid (percentages for each value sum to 100), the subset size is less than the universe size, and the universe does contain enough individuals with each selected value to hit the targets, it's still possible to fail if the values are actually not distributed independently.
Consider the case where all the males in the sample universe are Australian: in this case, you can only select as many males as you can select Australians, and vice-versa. So a set of constraints (subset size: 100; male: 40%; Australian 10%) cannot be satisfied at all from such a universe, even if all the Australians we select are male.
If we change the constraints (subset size: 100; male: 40%; Australian 40%), now we can possibly make a matching subset, but all of the Australians we select must be male. And if we change the constraints again (subset size: 100; male: 20%; Australian 40%), now we can possibly make a matching subset, but only if we don't pick too many Australian women (no more than half in this case).
In this latter case, selection order is going to matter. Depending on our random seed, sometimes we might succeed, and sometimes we might fail.
For this reason, the algorithm must (and my implementation does) be prepared to retry. I think of this as a patience test: the question is how many times are we willing to let it fail before we decide that the constraints are not compatible with the sample population.
Suitability
This method is well suited to the OP's task as described: selecting a random subset which matches given criteria. It is not suitable to answering a slightly different question: "is it possible to form a subset with the given criteria".
My reasoning for this is simple: the situations in which the algorithm fails to find a subset are those in which the data contains unknown linkages, or where the criteria allow a very limited number of subsets from the sample universe. In these cases, the use of any subset would be questionable for statistical analysis, at least not without further thought.
But for the purpose of answering the question of whether it's possible to form a subset, this method is non-deterministic and inefficient. It would be better to use one of the more complex shuffle-and-sort algorithms proposed by others.
Pre-Validation:
The immediate thought upon discovering that not all subsets can be satisfied is to perform some initial validation, and perhaps to analyze the data to see whether it's answerable or only conditionally answerable.
My position is that other than initially validating that each of the column rules is valid (i.e., the column percentages sum to 100, or near enough) and that the subset size is less than the universe size, there's no other prior validation which is worth doing. An argument can be made that you might want to check that the universe contains enough individuals with each selected value (e.g., that there actually are 40 males and 60 females in the universe), but I haven't implemented that.
Other than those, any analysis to identify linkages in the population is itself time-consuming that you might be better served just running the thing with more retries. Maybe that's just my lack of statistics background talking.
Not quite the subset sum problem
It has been suggested that this problem is like the subset sum problem. I contend that this is subtly and yet significantly different. My reasoning is as follows: for the subset sum problem, you must form and test a subset in order to answer the question of whether it meets the rules: it is not possible (except in certain edge conditions) to test an individual element before adding it to the subset.
For the OP's question, though, it is possible. As I'll explain, we can randomly select rows and test them individually, because each has a weight of one.

Resources