If i have a set of randomly generated numbers (integers), how do I find
a relationship between them so as to express them as a finite sequence and develop an
algorithm that can generate any nth term of the sequence given some seed data.
Is there any existing algorithm or framework or library that does such and if there isnt,
any suggestions on how to proceed.
Thanks.
It depends on the algorithm used to generate the (pseudo-random) numbers. If you want to predict future terms, you need some number of past terms and the algorithm used. If the algorithm is cryptographically secure, then you are out of luck. If it isn't, then you have a good chance of working out future terms.
I asked this question regarding linear congruence generators (commonly used for simple applications) a while ago. It gives a pretty good discussion of how to predict terms for that class of generator.
Related
Assuming that I have an array of twitter users and their followers, and I want to identify 5 users who have the most number of unqiue followers such that if I ask them to retweet an advertisement for my product, it would reach the most number of users.
I do not have formal programming or computer science training. However I do understand algorithms and basic CS concepts. It would be great if a solution can be provided in a way a layman could follow.
This is the "Maximum coverage problem", which is a class of problems thought to be difficult to solve efficiently (so-called NP-hard problems). You can read about the problem on wikipedia: https://en.wikipedia.org/wiki/Maximum_coverage_problem
A simple algorithm to solve it is to enumerate all subsets of size 5 of your friends, and measure the size of union of their followers. It is not an efficient solution, since if you've got n friends, then there's around n^5 subsets of size 5 (assuming n is large).
If you wanted a solution that's feasible to code and may be reasonably efficient in real-world cases, you might look at representing the problem as an "integer linear program" (ILP) and use a solver such as GLPK. The details of how to represent max-coverage as an ILP is given on the wikipedia page. Getting it working will require some effort though, and may still not work well if your problem is large.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
For the basic Genetic Algorithm implementation with a random crossover boundary and random number of mutations at random bit positions, a lot of inferior children are created and leaves the optimum solution to be discovered by chance. This wastes a lot of CPU, and the user does not know when the optimum solution is found, because it could always be "the next one".
Is there an algorithm to consistently get better children rather than leave this important process to chance?
Thank you.
As others have said the quality of offspring is dependant on a lot of factors and can often require experimentation, using known solutions, to get right.
However, one of the biggest factors in determining the quality of the children is the selection of the parent chromosome. Since stronger parents are more likely to create strong children the type of selection plays a big part.
The best type of selection (more common types are rank based, roulette wheel and tournament selection) like with most things Genetic Algorithms related are largely dependant on the problem, and can often require experimentation to get right.
On whether there is a better crossover/mutation algorithm for the basic Genetic Algorithm the answer is, not really. You can experiment with different kinds of crossover (1-point, two-point, n-point) and mutation (swap or replace). The values for each can also be altered. There are also plenty of things you can change or add to the Genetic Algorithm to improve efficiency (things like culling, duplicate removal, allowing the best chromosome into the next generation) but then your Genetic Algorithm would no longer be a basic Genetic Algorithm. Adding these features also means that you may have to do a lot more experimentation to get the features used, and their parameters, right.
As Michalewicz states in his book, How to solve it, there is no such thing as an off-the-shelf genetic algorithm. So, the answer to your question is basically what #OnABauer stated.
I would only like to complete his answer with a suggestion for you to look into a memetic algorithm (there is an interesting introduction here). If you add a local optimization operator, chances are that offspring will be improved (beware of local entrapment only).
For optimizations problems like the traveling salesperson, you can encode the solution so that all possible crossovers form a valid solution.
For example, instead of treating the genome as a list of cities (and thereby making every genome that misses a city or revisits a city as invalid), you can treat the genome as a list of transformations on a list of cities, starting with some (arbitrary) canonical list of cities.
Suppose we have a list of cities:
Azusa
Boca Raton
Cincinatti
Denver
If you treat each pair of bits as an encoding of one of the cities, then only a small number of bit patterns encodes a valid tour. Mutating and crossing between valid tours has a very small probability of resulting in another valid tour.
If you instead treat every four bits as a swap instruction. Now any list of bits is valid. To determine the correct tour, you start with an "official" ordering of the cities, and apply the list of swaps in order. You'll end with a valid tour, even if some of the swaps are no-ops.
I've used this approach in a couple of optimization problems with good results.
In essence, genetic algorithm is a type of search algorithm.
GA is a particular kind of heuristic search.
You are trying to explore the answers which you think are more likely to be the best first.
In GA, the basis of why you choose to explore an answer is because it is similar to a previously known good answers (parents).
GA also traditionally can terminate before exploring all the possible answers, which I think is the aspect that worries you the most.
If you want to always look at all possible answer, then you are considering a exhaustive search. For example, doing depth-first search through all possible answers.
In conclusion, GA is a heuristic search.
You choose it, if:
exhaustive search isn't fast enough.
you don't care if the final result is the best (globally optimal)
you understand how to guess for better answer based on explored answers. This depends on the problem domain. It is what determines what mutation and crossover operators.
I have been learning the genetic algorithm since 2 months. I knew about the process of initial population creation, selection , crossover and mutation etc. But could not understand how we are able to get better results in each generation and how its different than random search for a best solution. Following I am using one example to explain my problem.
Lets take example of travelling salesman problem. Lets say we have several cities as X1,X2....X18 and we have to find the shortest path to travel. So when we do the crossover after selecting the fittest guys, how do we know that after crossover we will get a better chromosome. The same applies for mutation also.
I feel like its just take one arrangement of cities. Calculate the shortest distance to travel them. Then store the distance and arrangement. Then choose another another arrangement/combination. If it is better than prev arrangement, then save the current arrangement/combination and distance else discard the current arrangement. By doing this also, we will get some solution.
I just want to know where is the point where it makes the difference between random selection and genetic algorithm. In genetic algorithm, is there any criteria that we can't select the arrangement/combination of cities which we have already evaluated?
I am not sure if my question is clear. But I am open, I can explain more on my question. Please let me know if my question is not clear.
A random algorithm starts with a completely blank sheet every time. A new random solution is generated each iteration, with no memory of what happened before during the previous iterations.
A genetic algorithm has a history, so it does not start with a blank sheet, except at the very beginning. Each generation the best of the solution population are selected, mutated in some way, and advanced to the next generation. The least good members of the population are dropped.
Genetic algorithms build on previous success, so they are able to advance faster than random algorithms. A classic example of a very simple genetic algorithm, is the Weasel program. It finds its target far more quickly than random chance because each generation it starts with a partial solution, and over time those initial partial solutions are closer to the required solution.
I think there are two things you are asking about. A mathematical proof that GA works, and empirical one, that would waive your concerns.
Although I am not aware if there is general proof, I am quite sure at least a good sketch of a proof was given by John Holland in his book Adaptation in Natural and Artificial Systems for the optimization problems using binary coding. There is something called Holland's schemata theoerm. But you know, it's heuristics, so technically it does not have to be. It basically says that short schemes in genotype raising the average fitness appear exponentially with successive generations. Then cross-over combines them together. I think the proof was given only for binary coding and got some criticism as well.
Regarding your concerns. Of course you have no guarantee that a cross-over will produce a better result. As two intelligent or beautiful parents might have ugly stupid children. The premise of GA is that it is less likely to happen. (As I understand it) The proof for binary coding hinges on the theoerm that says a good partial patterns will start emerging, and given that the length of the genotype should be long enough, such patterns residing in different specimen have chance to be combined into one improving his fitness in general.
I think it is fairly easy to understand in terms of TSP. Crossing-over help to accumulate good sub-paths into one specimen. Of course it all depends on the choice of the crossing method.
Also GA's path towards the solution is not purely random. It moves towards a certain direction with stochastic mechanisms to escape trappings. You can lose best solutions if you allow it. It works because it wants to move towards the current best solutions, but you have a population of specimens and they kind of share knowledge. They are all similar, but given that you preserve diversity new better partial patterns can be introduced to the whole population and get incorporated into the best solutions. This is why diversity in population is regarded as very important.
As a final note please remember the GA is a very broad topic and you can modify the base in nearly every way you want. You can introduce elitarism, taboos, niches, etc. There is no one-and-only approach/implementation.
I need to develop a system to select a team out of a database. Is it possible to use a Genetic algorithm to get the initial population (chromosomes) representing players as some identifier. Each identifier have its genes in a database which are used to apply various rules (such as requirements to be team leader, etc.).
Is GA helpful for such scenario?
Yes, it can be.
First, evolutionary algorithms work directly with the genotype of an individual. Stating that your are using identifiers to link an individual in the algorithm is either implementation details (useless for the question) and seems simply erroneous (you should load the genotype in memory for faster access).
Your problem is a simple combination problem. For a given number of players available n from which we want to form teams of size k, a total of n! / (k! ⋅ (n - k)!) combinations are possible. This is generally too much possibilities to handle on nowadays computing resources. Evolutionary algorithms allows (among others) the optimization of a given function too big for analytic resolution or where no analytic analysis exists.
You seem confused as to how to implement this kind of process. First, choosing a good data representation is important to get good results. You should first state every characteristics you want to optimize and what is their relation toward performance and if cross-relations affect global performance.
You should be careful, though: genetic algorithms can tend to get stuck in local maximums, be sure to keep your genetic diversity high by not punishing too hard relatively good solutions or with a steep selection phase.
That being said, the analysis I gave you was for a purely combinatorial view. From the point of view of a team, where context matters, evolutionary algorithms won't be efficient. For instance, if you need 3 attackers, 2 defenders and a goalkeeper, you should simply sort your player list three times, first according to the characteristics of a good attacker, then defender and finally goalkeeper and take the best elements (first elements after sorting) to compose your team. This will be way faster and give you an optimal result than using an evolutionary algorithm. Evolutionary algorithms such as genetic algorithsm would be of prime choice if you had no idea of the mechanics of the game played nor the inner workings of an optimal play.
Nevertheless, It is a good idea to begin toying with genetic algorithms to get a grasp of their possibilities and limitations. A good idea is to begin with a simple framework in a simple language such as deap or pyevolve in Python to try your ideas out.
I got the following interesting task:
Given a list of 1 million numbers with 16 digits (say, credit card numbers), which includes 990,000 purely random numbers generated by a computer system, and 10,000 created manually by fraudsters. These numbers are labeled as genuine or fraud. Build an algorithm to predict non-random numbers.
My approach so far is a bit of a brute-force: looking at non-random numbers to find patterns (such as repeated numbers: 22222, or 01234).
I wonder if there's a ready-made algorithm or tool for this kind of task. I imagine this task should be quite common among fraud analytic community.
Thanks.
First off, if you know they're credit card numbers, use Luhn's algorithm, which is a quick checksum algorithm for valid credit card numbers.
However, if they are simply 16 digit integers, there are a couple of approaches that you can use. It is hard to tell if an individual number came from a random source(as the number 1111111111111111 is just as likely as any other number out of a random number generator). As for your repeated numbers and patterns, that is very reminiscent of the concept of Kolmogorov complexity(see links below). You could try looking for patterns in this brute force method, but I feel like it would be quite inaccurate, as humans might actually tend to avoid putting digits and sequences in these numbers!
Instead, I suggest focusing on the way people generate numbers. You can treat human input like a very poor random number generator. So I recommend just making a list yourself of random human entered numbers, if you don't have another dataset. Then, you can use machine learning to generate a classifier algorithm to distinguish between purely random numbers(those without 'human-like' attributes that your machine learning algorithm has recognized). In terms of the metrics for the statistical classifier, Kolmogorov complexity could be one, perhaps frequency of digits for another metric(see Benford's law on Wikipedia), and number of repeating digits for another(humans might try to avoid repeating digits to look non-random, so let your classifier do the work!)
From my personal experience, tough problems like this are a textbook case for machine learning algorithms and statistical classifiers.
Hope this helps!
Links:
Kolmogorov Complexity
Complexity calculator