mutation in both offspring in genetic algorithm - genetic-algorithm

should we do mutation in both offspring (in same place and with same probability)?
the following is the code I used for mutation in binary population. is it true?
rand2=randi(100);
if rand2>=0 && rand2<=2
mu1_point=randi(m);
parent1(1,mu1_point)=abs(1-parent1(1,mu1_point));
parent2(1,mu1_point)=abs(1-parent2(1,mu1_point));
end

Related

Curve fit does not return expected result

I need a little help with my code during curve fitting some data.
I have the following data:
'''
x_data=[0.0, 0.006702200711821348, 0.012673613376102217, 0.01805805116486128, 0.02296065262674275, 0.027460615301376282,
0.03161908492177514, 0.03548425629114566, 0.03909479074665314, 0.06168416627459879, 0.06395092768264225,
0.0952415360565632, 0.0964823380829502, 0.11590819258911032, 0.11676250975220677, 0.18973251809768016,
0.1899603458289615, 0.2585011532435637, 0.2586068948029052, 0.40046782450999047, 0.40067753715444315]
y_data=[0.005278154532534359, 0.004670803439961002, 0.004188802888597246, 0.003796976494876385, 0.003472183813732432,
0.0031985782141146, 0.002964943046115825, 0.0027631157936632137, 0.0025870148284089897, 0.001713418196416643,
0.0016440241050665323, 0.0009291243501697267, 0.0009083385934116964, 0.0006374601714823219, 0.0006276132323039056,
0.00016900738921547616, 0.00016834735819595378, 7.829234957755694e-05, 7.828353274888779e-05, 0.00015519569743801753,
0.00015533437619227267]
'''
I know that the data can be fitted using the following mathematical model:
'''
def model(x,a,b,c):
return (ab)/(bx+1)+3cx**2
'''
I am trying to obtain the a,b,c coefficients of the model calibrated, so that I obtain the following result (in red is the model calibrated and in blue is the data sample):
My code to achieve the shown result in the former picture is:enter image description here
'''
import numpy as np
from scipy.optimize import curve_fit
popt, _pcov = curve_fit(model, x_data, y_data,maxfev = 100000)
x_sample=np.linspace(0,0.5,1000)
y_sample=model(x_sample,*popt)
'''
If I plot the predicted data based on the fitted coefficients (in green) I get this result:enter image description here
for some reason I get some coefficients that produce a result I know it is wrong. Does anyone know how to solve this issue?
Your model y=(ab)/(bx+1)+3cx**2 appears not really satisfising. Instead of the hyperbolic term an exponential term seems better according to the shape of the data. That is why the proposed model is :
y=A * exp(B * x) + C * x**2
The method to compute approximates of the parameters A,B,C is shown below :
Details of the numerical calculus :
Note :
The parabolic term appears under represented. This is because they are not enough points at large x compare to the many points at small x.
The method used above is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales. The method isn't iterative and doesn't need initial "guessed" values. The accuracy is not good in case of few points, due to the numerical integration (calculus of the Sk).
If necessary, this can be improved thanks to post-treatment with non-linear regression starting from the above approximative values of the parameters;
An even better model is made of two exponentials :

How can I get my value for generations to work for my genetic algorithm (Python)

I have created a genetic algorithm to solve the problem of watching shows in what order to maximise surplus by using a code inspired by Kie Codes by following the code within this video https://www.youtube.com/watch?v=nhT56blfRpE
Unfortunately, I am unable to get generations part to work and it results in having 0 generations in the result even though it has created a solution that is different every once in a run.
The variable I have is name, fitness and weight which translates to the name of the show, the rating of it, and the number of episodes it has.
As for the generations part of the code will not work it will also not account for the about of time it takes for the program to create the optimal solution.
from random import choices, randint, randrange, random #allows for a random creations to be made
from typing import List, Callable, Tuple #allows to use list functions and callable
from collections import namedtuple #allows the use for named tuple
import time #allows for time to be used
from functools import partial #allows for partial to be used
Genome=List[int]
Population=List[Genome]
FitnessFunc=Callable[[Genome],int] #puts the fitness with the genome
PopulateFunc=Callable[[], Population]#gives out new solutions
SelectionFunc=Callable[[Population, FitnessFunc], Tuple[Genome,Genome]] #takes the population and fitness to make select for a new solution
CrossoverFunc=Callable[[Genome, Genome], Tuple[Genome,Genome]]#takes two genomes and produces two genomes
MutationFunc=Callable[[Genome], Genome]#takes a genome and sometimes produces a new genome
PrinterFunc=Callable[[Population, int, FitnessFunc], None]
#puts the functions into perameters
Thing=namedtuple('Thing',['name','value','weight'])#gives a structure for the list below
things=[
Thing('Seishun Buta Yarou wa Bunny Girl Senpai no Yume wo Minai (TV)',8.38,13),
Thing('Bakemonogatari',8.36,15),
Thing('Kodomo no Jikan (TV)',6.82,12),
Thing('Tsuki ga Kirei',8.18,12),
Thing('Tokyo Ravens',7.53,24),
Thing('Kono Subarashii Sekai ni Shukufuku wo!',8.15,10),
Thing('Shingeki no Kyojin S4',9.2,12),
Thing('Dr Stone S2',8.33,11),
Thing('The Promised Neverland',8.11,11),
Thing('Re:Zero',8.7,12),
Thing('Toradora!',8.24,25),
Thing('Sousei no Onmyouji',7.33,50),
Thing('Rurouni Kenshin: Meiji Kenkaku Romantan',8.31,94),
Thing('Fullmetal Alchemist: Brotherhood',9.2,64),
Thing('Steins;Gate',9.11,24),
Thing('Boku no Pico',4.31,3)
]
def generate_genome(length: int) -> Genome:
return choices([0,1], k=length)
#creates a random genome
##genome
def generate_population(size: int, genome_length: int) -> Population:
return [generate_genome(genome_length)for _ in range(size)]
#generates a list of genomes
##population
def fitness(genome: Genome, things: things, weight_limit: int) -> int:
if len(genome)!=len(things):#checks if the two are the same length
raise ValueError("genome and things must be of the same length")
weight=0
value=0
for i, thing in enumerate (things):
if genome[i]==1:
weight+=thing.weight
value+=thing.value #adds the weight values and fitness values
if weight>weight_limit: #checks if the genome is under the weight limit
return 0
return value
#fitness function
def selection_pair(population: Population, fitness_func: FitnessFunc) -> Population:
return choices(
population=population,
weights=[fitness_func(genome) for genome in population],
k=2 #draw twice (two parents)
)
#selection function
def single_point_crossover(a: Genome,b: Genome) -> Tuple[Genome,Genome]:
if len(a)!=len(b):
raise ValueError("Genomes a and b must be of same length")
#checks the size of the genomes that was selected
#the genomes have to be the same in length or it will not work
length=len(a)
if length<2:#checks the length as they would have to be at least 2 or else you are unable to cut them in half
return a,b
p=randint(1, length-1)#randomly takes parts of the two genomes
return a[0:p]+b[p:],b[0:p]+a[p:]#and then creates them
#crossover function
def mutation(genome: Genome, num: int=1, probability: float=0.5) -> Genome:
for _ in range(num):
index=randrange(len(genome))#if the number made is higher than the probability it is left alone
genome[index]=genome[index] if random() > probability else abs(genome[index]-1)#makes the absolute value so that it is either 1 or 0
#people hatee this one but why ^^
return genome
#mutation function
def run_evolution(
populate_func: PopulateFunc,
fitness_func: FitnessFunc,#for these two it populates with variables
fitness_limit: int,#if the fitness limit is reached then the program is done
selection_func: SelectionFunc=selection_pair,
crossover_func: CrossoverFunc=single_point_crossover,
mutation_func: MutationFunc=mutation,# initialises the above three
generation_limit: int=100 #limits on how much is made
) -> Tuple[Population, int]:
population=populate_func()#populates with random genomes
for generations in range(generation_limit):
population=sorted(
population,
key=lambda genome: fitness_func(genome),
reverse=True
#sorts the fitness of the genomes so that the higher number is at the start
)
if fitness_func(population[0])>=fitness_limit:
break
#if the fitness limit is reached then the program ends (breaks)
next_generation=population[0:2]#gets the top two genomes for the next generation
for j in range(int(len(population)/2)-1):
parents=selection_func(population, fitness_func) #gets the top two to be made as parents
offspring_a, offspring_b=crossover_func(parents[0], parents[1]) #puts the genomes into seprate variables
offspring_a=mutation_func(offspring_a)
offspring_b=mutation_func(offspring_b)#creates a mutation for both genomes randomly
next_generation+=[offspring_a, offspring_b]
#creates the next genomes and puts them into population
population=next_generation
population=sorted(
population,
key=lambda genome: fitness_func(genome),
reverse=True
#sorts the fitness of the genomes so that the higher number is at the start
)
return population, generations
#evolutionary main loop
def genome_to_things(genome: Genome, things: things) -> things:
result=[]
for i, thing in enumerate(things):
if genome[i]==1:
result+=[thing.name]
return result
#collects the best solutions so that it may be printied out for later
start=time.time()
population, generations=run_evolution(
populate_func=partial(
generate_population, size=10, genome_length=len(things)
),
fitness_func=partial(
fitness, things=things, weight_limit=94
),
fitness_limit=10,
generation_limit=100
)
end=time.time()
#runs the program
print(f"number of generations: {generations}")
print(f"time: {end-start}s")
print(f"best solution: {genome_to_things(population[0], things)}")
#prints results
It means your GA finds a solution that satisfies your fitness limit directly in the first initial generation 0.
If I run your code with fitness_limit=100 I get the following output.
number of generations: 99
time: 0.014637947082519531s
best solution: ['Seishun Buta Yarou wa Bunny Girl Senpai no Yume wo Minai (TV)', 'Bakemonogatari', 'Tsuki ga Kirei', 'Kono Subarashii Sekai ni Shukufuku wo!', 'Shingeki no Kyojin S4', 'Dr Stone S2', 'Re:Zero', 'Boku no Pico']

Misclassification cost matrix on the Random Forest classifier implemented by Apache Spark (MLLIB)

I saw that the Random Forest classifier implemented by Apache Spark's MLLIB does not support a misclassification cost matrix for training and metric evaluation.
QUESTIONS: Is there a way to implement this without changing the MLLIB's code? How?
Each split on each tree should consider the cost matrix (e.g., C5.0 cost matrix related question).
Below is a part of a SCALA code for training and prediction using the [new] Random Forest classifer provided by MLLIB.
[...]
// Train a RandomForest model.
// Here, I would like to insert a misclassification cost matrix
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("rawFeatures")
.setNumTrees(500)
.setImpurity("gini")
.setMaxBins(500)
.setMaxDepth(30)
.setFeatureSubsetStrategy("50")
// Chain indexers and forest in a Pipeline
val pipeline = new Pipeline()
.setStages(Array(assembler,rf))
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
[...]
// I could use predictions and the cost matrix to evaluate
// some metrics directly.
[...]
Thank you in advance.
regards,
Paulo Angelo

Algorithm to find which hashes in a list match another hash the fastest? (this is complicated to explain on title)

Explaining with words when 2 hashes would match is complicated, so, see the example:
Hash patterns are stored in a list like: (I'm using JavaScript for notation)
pattern:[
0:{type:'circle', radius:function(n){ return n>10; }},
1:{type:'circle', radius:function(n){ return n==2; }},
2:{type:'circle', color:'blue', radius:5},
... etc]
var test = {type:'circle', radius:12};
test should match with pattern 0 because pattern[0].type==test.type && pattern.radius(test.radius)==true.
So, trying with words, a hash matches a pattern if every of it's values is either equal of those of the pattern or returns true when applied as a function.
My question is: is there an algorithm to find all patterns that match certain hash without testing all of them?
Consider a dynamic, recursive, decision tree structure like the following.
decision:[
field:'type',
values:[
'circle': <another decision structure>,
'square': 0, // meaning it matched, return this value
'triangle': <another decision structure>
],
functions:[
function(n){ return n<12;}: <another decision structure>,
function(n){ return n>12;}: <another decision structure>
],
missing: <another decision structure>
]
Algorithm on d (a decision structure):
if test has field d.field
if test[d.field] in d.values
if d.values[test[d.field]] is a decision structure
recurse using the new decision structure
else
yield d.values[test[d.field]]
foreach f => v in d.functions
if f(test[d.field])
if v is a decision structure
recurse using the new decision structure
else
yield v
else if d.missing is present
if d.missing is a decision structure
recurse using the new decision structure
else
yield d.missing
else
No match

how to identify the minimal set of parameters describing a data set

I have a bunch of regression test data. Each test is just a list of messages (associative arrays), mapping message field names to values. There's a lot of repetition within this data.
For example
test1 = [
{ sender => 'client', msg => '123', arg => '900', foo => 'bar', ... },
{ sender => 'server', msg => '456', arg => '800', foo => 'bar', ... },
{ sender => 'client', msg => '789', arg => '900', foo => 'bar', ... },
]
I would like to represent the field data (as a minimal-depth decision tree?) so that each message can be programatically regenerated using a minimal number of parameters. For example, in the above
foo is always 'bar', so I don't need to mention it
sender and client are correlated, so I only need to mention one or the other
and msg is different each time
So I would like to be able to regenerate these messages with a program along the lines of
write_msg( 'client', '123' )
write_msg( 'server', '456' )
write_msg( 'client', '789' )
where the write_msg function would be composed of nested if statements or subfunction calls using the parameters.
Based on my original data, how can I determine the 'most important' set of parameters, i.e. the ones that will let me recreate my data set using the smallest number of arguments?
The following papers describe algortithms for discovering functional dependencies:
Y. Huhtala, J. Kärkkäinen, P. Porkka,
and H. Toivonen. TANE: An efficient
algorithm for discovering functional
and approximate dependencies. The
Computer Journal, 42(2):100–111,
1999, doi:10.1093/comjnl/42.2.100.
I. Savnik and P. A. Flach. Bottom-up
induction of functional dependencies
from relations. In Proc. AAAI-93 Workshop:
Knowledge Discovery in Databases,
pages 174–185, Washington, DC, USA,
1993.
C. Wyss, C. Giannella, and E.
Robertson. FastFDs: A
Heuristic-Driven, Depth-First
Algorithm for Mining Functional
Dependencies from Relation Instances.
In Proc. Data Warehousing and Knowledge Discovery, pages 101–110, Munich,
Germany, 2001, doi:10.1007/3-540-44801-2.
Hong Yao and Howard J. Hamilton. "Mining functional dependencies from data." Data Mining and Knowledge Discovery, 2008, doi:10.1007/s10618-007-0083-9.
There has also been some work on discovering multivalued dependencies:
I. Savnik and P. A. Flach. "Discovery
of Mutlivalued Dependencies from
Relations." Intelligent Data Analysis
Journal, 4(3):195–211, IOS Press, 2000.
This looks very similar to Database Normalization.
You have a relation (your test data set), and some known functional dependencies ({sender} => arg, {} => foo and possibly {msg} => sender. If the order of tests is important then add {testNr} => msg.) and you want to eliminate redundancies.
Treat your test set as a database table, apply the normalization rules and create equivalent functions (getArgFromSender(sender) etc.) for each join.
If the number of fields and records is small:
Brute force it by looping through every combination of fields, and for each combination detect if there are multiple items in the list which map to the same value.
If you can live with a fairly good choice of fields:
Start off assuming you need all fields. Then, select a field at random and see if it can be eliminated; if it can, cross it off the set of fields. Otherwise, choose another field at random and try again. If you find no fields can be eliminated, then you've found a reasonable set of fields. Had you chosen other fields first, you may find a better solution. You can repeat the whole procedure a few times and pick the best solution if you like. This kind of approach is called hill climbing.
(I suspect that this problem is NP complete, i.e. we probably don't know of an efficient and powerful solution so it is not worth losing sleep over trying to dream up a perfect solution.)

Resources