zettelkasten

Search IconIcon to open search
Dark ModeDark Mode

2022-05-29 Evolving String using Genetic Algorithm (Part 2 - Visualisations & Analysis)

Date: 29 May 2022

#post #python

This post originally appeared on Blog 2.0


This is a continuation of Part 1 of Evolving String using Genetic Algorithm. It’s time to fire up pandas and seaborn and pyplot again!

H2 Setting up

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()

Load the data

results = pd.read_csv('results.csv')

Table 1: Raw Data

A total number of 84 experiments were conducted, each of which involved 5 trials, and each trial had 1000 data points. There are, as a result, 420,000 rows in the data table. Therefore, only the head and tail of the data table are shown here.

results

Unnamed: 0experiment_idtrial_numpopulation_sizemutation_ratecross_strattargetchar_setgeneration_numavg_fitnessesnum_target_hits
000.00.050.00.0005midpointmultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.1.00.0520690.0
110.00.050.00.0005midpointmultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.2.00.0620690.0
220.00.050.00.0005midpointmultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.3.00.0648280.0
330.00.050.00.0005midpointmultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.4.00.0772410.0
440.00.050.00.0005midpointmultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.5.00.0806900.0
....................................
41999599583.04.01000.00.1000randommultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.996.00.1565000.0
41999699683.04.01000.00.1000randommultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.997.00.1574480.0
41999799783.04.01000.00.1000randommultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.998.00.1572070.0
41999899883.04.01000.00.1000randommultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.999.00.1609140.0
41999999983.04.01000.00.1000randommultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.1000.00.1588280.0

420000 rows × 11 columns

H2 Downscaling dataset

There are simply way too many data points to deal with. Looks like we could have recorded data less frequently. In that case, we will take data points at intervals of 10 generations.

And… Oops, we didn’t record stats for generation 0! What a blunder. Since the experiment took two hours to run, we’ll just include generation 1 as the first data point.

processed = results[(results['generation_num'] % 10 == 0) | (results['generation_num'] == 1)]
processed['hyper'] = 'P: ' + processed["population_size"].astype(str) + '; R: ' + processed["mutation_rate"].astype(str) + '; S: ' + processed["cross_strat"].astype(str)
processed
/var/folders/tt/63g2nkx15fl855gr711x66540000gn/T/ipykernel_23030/3281391435.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  processed['hyper'] = 'P: ' + processed["population_size"].astype(str) + '; R: ' + processed["mutation_rate"].astype(str) + '; S: ' + processed["cross_strat"].astype(str)

Unnamed: 0experiment_idtrial_numpopulation_sizemutation_ratecross_strattargetchar_setgeneration_numavg_fitnessesnum_target_hitshyper
000.00.050.00.0005midpointmultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.1.00.0520690.0P: 50.0; R: 0.0005; S: midpoint
990.00.050.00.0005midpointmultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.10.00.1072410.0P: 50.0; R: 0.0005; S: midpoint
19190.00.050.00.0005midpointmultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.20.00.1437930.0P: 50.0; R: 0.0005; S: midpoint
29290.00.050.00.0005midpointmultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.30.00.1889660.0P: 50.0; R: 0.0005; S: midpoint
39390.00.050.00.0005midpointmultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.40.00.2075860.0P: 50.0; R: 0.0005; S: midpoint
.......................................
41995995983.04.01000.00.1000randommultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.960.00.1543620.0P: 1000.0; R: 0.1; S: random
41996996983.04.01000.00.1000randommultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.970.00.1563280.0P: 1000.0; R: 0.1; S: random
41997997983.04.01000.00.1000randommultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.980.00.1614830.0P: 1000.0; R: 0.1; S: random
41998998983.04.01000.00.1000randommultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.990.00.1607240.0P: 1000.0; R: 0.1; S: random
41999999983.04.01000.00.1000randommultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.1000.00.1588280.0P: 1000.0; R: 0.1; S: random

42420 rows × 12 columns

H2 Overview visualisation

Right now, let’s just mindlessly plot everything and see what happens.

Graph 1: Average Fitness over Generations

The average fitness for each trial for each experiment is plotted here. Looks like it’s a mess—we can’t really read much out of this graph alone. At least we know that evolution slowed down over time and there are fluctuations.

plt.figure(figsize=(50, 30))
ax = sns.lineplot(x="generation_num", y="avg_fitnesses", data=processed, palette="tab20", hue="hyper", style="trial_num")

ax.set_title('average fitness over generations')
ax.set_xlabel('generation')
ax.set_ylabel('average fitness')
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

plt.show()

optimage - genetic alg analyse_12_0.jpg

Graph 2: Number of Perfect Matches over Generations (log scale)

Out of curiosity, we recorded the cumulative number of individuals that have a genome that’s exactly the same as the target genome. We’re beginning to see something interesting here. All trials of the brown and pink experiments reached relatively high numbers, three trials of the yellow experiments took off, and every other experiment kind of lay at the bottom.

plt.figure(figsize=(50, 30))
ax = sns.lineplot(x="generation_num", y="num_target_hits", data=processed, palette="tab20", hue="hyper", style="trial_num")

ax.set_title('total number of optimal genomes over generations (log scale)')
ax.set_xlabel('generation')
ax.set_ylabel('cumulative number of optimal genomes')
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

plt.yscale('log')
plt.show()

optimage - genetic alg analyse_14_0.jpg

H2 Matrix

Well, just so that we have a less overwhelming visualisation, maybe it helps to have more summary statistics. We’ll begin by only taking the final generation and taking means over 5 trials for each experiment. Then we’ll make pivot tables of variables that we care about.

final_gen_summary = processed[processed['generation_num'] == 1000]
final_gen_summary = final_gen_summary.groupby(by=['experiment_id', 'cross_strat', 'population_size', 'mutation_rate'], as_index=False).mean()
final_gen_summary = final_gen_summary.drop(columns=['Unnamed: 0', 'trial_num'])

final_gen_summary

experiment_idcross_stratpopulation_sizemutation_rategeneration_numavg_fitnessesnum_target_hits
00.0midpoint50.00.00051000.00.3309660.0
11.0random50.00.00051000.00.4024830.0
22.0midpoint50.00.00101000.00.3437930.0
33.0random50.00.00101000.00.4220000.0
44.0midpoint50.00.00501000.00.4215860.0
........................
7979.0random1000.00.02001000.00.4729550.0
8080.0midpoint1000.00.05001000.00.2429280.0
8181.0random1000.00.05001000.00.2634140.0
8282.0midpoint1000.00.10001000.00.1474310.0
8383.0random1000.00.10001000.00.1586030.0

84 rows × 7 columns

Table 2: Summary Fitness Statistics

A pivot table showing the end of experiment summary statistics of average fitness.

final_gen_fitnesses = final_gen_summary.pivot(index='population_size', columns=['cross_strat', 'mutation_rate'], values='avg_fitnesses')span>&
final_gen_fitnesses

cross_stratmidpointrandom
mutation_rate0.00050.00100.00500.01000.02000.05000.10000.00050.00100.00500.01000.02000.05000.1000
population_size
50.00.3309660.3437930.4215860.3408970.2820000.1701380.1257930.4024830.4220000.4489660.3486900.3175860.2127590.127586
100.00.4440000.4640340.5307930.4533790.3257590.2030340.1321380.5369310.5993790.5534140.4908280.3809660.2368970.152690
150.00.5225060.5667360.6102990.5051260.3604370.2225750.1400690.6836550.7138390.6503680.5436550.4118390.2290340.151379
200.00.6001550.6570520.6320690.5367590.3688450.2189480.1384310.7718790.7564660.7012240.5717410.4295170.2573970.155759
500.00.8289720.8200340.7554070.6030340.4239170.2401720.1501170.9584000.9030280.7759310.6195720.4588210.2632280.154641
1000.00.9339900.9182380.7744480.6216410.4484860.2429280.1474310.9741520.9479970.7822720.6305070.4729550.2634140.158603

Table 3: Summary Hit Count Statistics

another pivot table showing the mean number of indiviuals reaching the target genome.

final_gen_hits = final_gen_summary.pivot(index='population_size', columns=['cross_strat', 'mutation_rate'], values='num_target_hits')span>&
final_gen_hits

cross_stratmidpointrandom
mutation_rate0.00050.00100.00500.01000.02000.05000.10000.00050.00100.00500.01000.02000.05000.1000
population_size
50.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
100.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
150.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
200.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
500.00.00.00.00.00.00.00.038407.61.60.00.00.00.00.0
1000.00.01.80.00.00.00.00.0161351.828208.40.00.00.00.00.0

Alright then, time for a heatmaps. This is going to help with revealing the collective impact of mutation rate and population size.

# do a numpy array
fitness_random_matrix = final_gen_fitnesses['random']
hits_random_matrix = final_gen_hits['random']

fitness_midpoint_matrix = final_gen_fitnesses['midpoint']
hits_midpoint_matrix = final_gen_hits['midpoint']

Graph 3: average fitness at generation 1000 for random crossing

fig = plt.figure(figsize=(8, 6))
matrix = fitness_random_matrix
ax = sns.heatmap(matrix, annot=True, square = True, cmap='cividis', xticklabels=list(matrix.keys()), yticklabels=list(matrix.index))
ax.set_title("average fitness at generation 1000 for random crossing")
ax.set_xlabel("mutation rate")
ax.set_ylabel("population size")
plt.show()

optimage - genetic alg analyse_25_0.jpg

Graph 4: cumulative hits at generation 1000 for random crossing

fig = plt.figure(figsize=(8, 6))
matrix = hits_random_matrix
ax = sns.heatmap(matrix, annot=True, square = True, cmap='cividis', xticklabels=list(matrix.keys()), yticklabels=list(matrix.index))
ax.set_title("cumulative hits at generation 1000 for random crossing")
ax.set_xlabel("mutation rate")
ax.set_ylabel("population size")
plt.show()

optimage - genetic alg analyse_27_0.jpg

Graph 5: average fitness at generation 1000 for midpoing crossing

fig = plt.figure(figsize=(8, 6))
matrix = fitness_midpoint_matrix
ax = sns.heatmap(matrix, annot=True, square = True, cmap='cividis', xticklabels=list(matrix.keys()), yticklabels=list(matrix.index))
ax.set_title("average fitness at generation 1000 for midpoing crossing")
ax.set_xlabel("mutation rate")
ax.set_ylabel("population size")
plt.show()

optimage - genetic alg analyse_29_0.jpg

Graph 6: cumulative hits at generation 1000 for midpoing crossing

fig = plt.figure(figsize=(8, 6))
matrix = hits_midpoint_matrix
ax = sns.heatmap(matrix, annot=True, square = True, cmap='cividis', xticklabels=list(matrix.keys()), yticklabels=list(matrix.index))
ax.set_title("cumulative hits at generation 1000 for midpoing crossing")
ax.set_xlabel("mutation rate")
ax.set_ylabel("population size")
plt.show()

optimage - genetic alg analyse_31_0.jpg

H2 Observations

Graph 1 plots the average fitness of the population over the evolutionary process in intervals of 10 steps. Looking at the graph, we can observe that the range of parameters tested gave a wide range of different results. In general, the population’s average fitness increases at a fast rate from the start and levels off over time. However, with different population sizes and mutation rates, the final fitness of the population level sustain at different values ranging from 0.1 to 0.97. This graph, however, is only an overview of all the experiments and does not clearly present the correlation between our variables of interest.

Graph 2 shows the number of individuals that had the exact genome as the target genome. All trials of the brown and pink experiments reached relatively high numbers, three trials of the yellow experiments took off, and all other experiments kind of lay at the bottom. This indicates that only a few combinations of parameters led to an evolution in which the target genome thrived.

Graph 3-6 shows the effect of population size and mutation rate on the simulated evolution. Observing the differences in brightness between random crossing (Graph 3, 4) and midpoint crossing (Graph 5,6), one can conclude that random crossing was slightly more effective than midpoint crossing. The effect of mutation rate and population size and between mutation rate and average fitness is more prominent and has a similar trend across the two crossing strategies. Namely, lower mutation rate and larger population size lead to higher average fitness

H2 Discussions

A likely explanation for the positive correlation between population size and average fitness is that a higher population size allows more genetic variation, just like in the real world. In this case, it is more likely for certain genes to show up in the population.

Seeing that there was a negative correlation between average fitness and mutation rate, one might suspect that a mutation rate that’s too high disturbs the genes and make inheriting genetic information less effective for shifting the population towards a single target genome. However, this might not be representative of the factors that drive evolution in nature and could only be used as an example of a very specific case of evolution.

For the difference between the two crossover strategies tested in this experiment, it is unclear from the data how to explain the difference in the effect on average fitness. In real-life crossovers in meiosis, the process is much involved as multiple chromosomes and multiple gene segments are exchanged. As a result, neither random crossing nor midpoint crossing accurately models real-life scenarios that, arguably, have characteristics of both strategies.

H2 Limitations

This experiment significantly simplifies evolution seen in Biology. It shortens the genome, directly selects genotype rather than phenotype, and defines a single target genotype rather than allowing a range of individuals to survive. Due to time limitations, this experiment is will be done this way to simplify calculations. Further experiments could include a more complex model that takes into consideration factors that are ignored in this experiment, but these are beyond the scope of this experiment.

As seen in Graph 1, some experiments were restrained to some very low fitness. In actual ecosystems, it might be the case that the population size will reduce significantly in this case.

By the look of the heatmaps, larger population sizes and smaller mutation rates are more likely to lead to higher fitness. As the heatmap is most intense at high population sizes and low mutation rates, it is unreasonable to extrapolate the pattern any further beyond the tested range. The optimal mutation rate for this task might well lie below the smallest tested value of 0.005 and would need further experimentation to be better understood.