zettelkasten

Search IconIcon to open search
Dark ModeDark Mode

Evolving String using Genetic Algorithm (Part 1 - The Experiment)

Date: 25 May 2022

#post #python

This post originally appeared on Blog 2.0


There is this thing called genetic algorithm that simulates natural selection that guides the computer to evolve towards a certain behaviour. Let’s jump in and experiment a little bit. It would be a great python exercise anyway. This project is heavily inspired by Daniel Shiffman’s book The Nature of Code.

Ummm it’s just a fun project, so I guess it’s going to just be some casual writing.

First, let’s import some stuff

import random
from tqdm import tqdm
import pandas as pd
import itertools

Alright, let’s think about what should be included in this project.

A simple start is to try evolving a string of text. The characters are like nucleotides. We manually define some fitness functions that we can use to assign reproductive fitness to each individual in the population. Each generation, we somehow pull parents from a mating pool, cross over their genetic information in some way, do some mutation, and move on.

So… here are some brainstorm of what the experiment might look like

  • population size = 1000, etc.
  • mutation rate = 0.01, etc.
  • options for target
    • a long one "one general law, leading to the advancement of all organic beings, namely, multiply, vary, let the strongest live and the weakest die."
    • or a shorter and easier one? "multiply, vary, let the strongest live and the weakest die"
  • The character set
    • maybe list(" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!?%&'(),-./:;")
    • or list(" abcdefghijklmnopqrstuvwxyz,.")
  • simulating fitness
  • cross over strategy
    • random crossing
    • midpoint crossing

Anyhow, let’s just say that we are only interested in the influence of population size, mutation rate, and cross over strategy. We’ll keep the other ones constant.

Before writing any code, I’ll take my turn making some hypotheses. Well these are just guesses; don’t take them seriously

  • If the mutation rate is too high, evolution might fail to converge as random changes to the genotype overwrites most inherited information. If the mutation rate is too low, there will be little variation in the gene pool, making it hard for a shift in the population’s genetic makeup.
  • A large population size allows more diverse genes in the population and thus a higher likelihood that a gene that gives individuals survival/reproductive advantage will be present in the population and get selected for. On the other hand, if the population size is too small, there’s less genetic variation and it would take more generations to reach higher fitness.
  • Midpoint crossing allows higher genetic stability whereas random crossing introduces more variations?

The algorithm for each generation will roughly look like this:

  • Declare constants
    • The target configuration is modelled by a sentence (e.g. “one general law, leading to the advancement of all organic beings, namely, multiply, vary, let the strongest live and the weakest die.”)
    • The set of genetic materials is " abcdefghijklmnopqrstuvwxyz,."
  • Generate a population with random genes
  • Selection
    • Calculate the fitness of each individual in the population; a fitness function could look like: fitness = number of matching letters / length of the gene segment
    • Assign high mating probability for individuals with higher fitness and store the information in a mating pool
    • Sample parents from the mating pool
  • Reproduction
    • Cross over the genes of two parents
      • if midpoint crossing
        • Generate a random midpoint
        • Cut out the segment before the midpoint from parent 1’s gene
        • Cut out the segment after the midpoint from parent 2’s gene
        • combine the two segments
      • if random crossing
        • randomly change each character with probability = mutation rate
  • Mutation: each letter in the child’s gene has a probability of changing to another letter

This algorithm is repeated and the evolved genes are checked against the target gene

Since randomness is a part of the procedure (to simulate random mutations), it is expected that the evolutionary outcome for each trial will be different. Multiple trials are performed using each set of parameters to get varied results. Since this experiment is ran on a computer, multiple trials can be performed to increase the precision of the data.

Alright, we define two classes, DNA, which contains information and functions that a genome would have. We also need an Experiment class. Since we are running experiments with different independent variables, it probably helps to create an instance for each experiment so that everything is contained.

class DNA():
    def __init__(self, target, char_set):
        self.target = target
        self.genome = ''.join([random.choice(char_set) for i in range(len(target))])
        self.char_set = char_set
        
    def count_matches(self):
        assert len(self.genome) == len(self.target)
        count = 0
        for i in range(len(self.genome)):
            if self.genome[i] == self.target[i]:
                count += 1
        return count    
        
    def calc_fitness(self):
        self.fitness = self.count_matches() / len(self.target)
        # self.fitness = self.count_matches() ** 2
        
    def crossover(self, partner, strategy = "midpoint"):
        
        # MIDPOINT CROSS
        if strategy == "midpoint":
            midpoint = random.randint(0, len(self.genome))
            child_genome = self.genome[:midpoint] + partner.genome[midpoint:]
    
        # RANDOM CROSS
        elif strategy == "random":
            child_genome = ''
            for i in range(len(self.genome)):
                if random.random() < 0.5:
                    child_genome += self.genome[i]
                else:
                    child_genome += partner.genome[i]
        
        # Rien
        else:
            raise 'Invalid strategy'
                        
        child = DNA(self.target, self.char_set)
        child.genome = child_genome
        return child
    
    def mutate(self, mutation_rate):
        self.genome = ''.join([random.choice(self.char_set) if random.random() < mutation_rate else c for c in self.genome])
class Experiment():
    def __init__(self, target, population_size, mutation_rate, char_set, cross_strat):
        self.target = target
        self.population_size = population_size
        self.mutation_rate = mutation_rate
        self.char_set = char_set
        self.cross_strat = cross_strat
        self.generation = 0
        self.population = [DNA(self.target, self.char_set) for i in range(self.population_size)]
        self.num_target_hits = 0
    
    def evolve(self, num_generations):
        
        avg_fitnesses = []
        num_target_hits = []
        generation_num = []
                
        pbar = tqdm(range(num_generations), leave=False, mininterval=1)
        for i in pbar:
            # selection
            mating_pool = []
            for individual in self.population:
                individual.calc_fitness()
                n = (individual.fitness) * 100
                # n = (individual.fitness)
                mating_pool += [individual for j in range(int(n))]
                
            # reproduce
            self.population = []
            for j in range(self.population_size):
                parent1 = random.choice(mating_pool)
                parent2 = random.choice(mating_pool)
                child = parent1.crossover(parent2, strategy = self.cross_strat)
                child.mutate(self.mutation_rate)
                if child.genome == self.target:
                    self.num_target_hits += 1
                self.population += [child]
                
            # increase generation count
            self.generation += 1
            
            # output every 10 percent
            # if num_generations % int(num_generations / 10) == 0:
            #     pbar.set_description_str(child.genome)
            
            # construct stats
            avg_fitnesses.append(self.avg_fitness())
            num_target_hits.append(self.num_target_hits)
            generation_num.append(self.generation)
        
        # return stats
        return avg_fitnesses, num_target_hits, generation_num
        
    def avg_fitness(self):
        # sample result
        avg_fitness = 0
        for individual in self.population:
            individual.calc_fitness()
            avg_fitness += individual.fitness
            # print(individual.genome, individual.fitness)
        avg_fitness /= len(self.population)
        return avg_fitness
    
    def print_samples(self, num_samples):
        for i in range(num_samples):
            sample = random.choice(self.population)
            sample.calc_fitness()
            print(f'"{sample.genome}", fitness = {sample.fitness}')
    
    def print_summary(self, num_samples):
        print(f'AVG FITNESS = {self.avg_fitness()}')
        print(f'NUM HITS: {self.num_target_hits}')
        print()
        self.print_samples(num_samples)

Alright, now it’s time to come up with the specific independent variables we will test and put them in a dictionary. There are many different combinations. I’ll just let python combine them automatically.

independents = {
    'population_size': [50, 100, 150, 200, 500, 1000],
    'mutation_rate': [0.0005, 0.001, 0.005, 0.01, 0.02, 0.05, 0.1],
    'cross_strat': ['midpoint', 'random'],
    'target': ["multiply, vary, let the strongest live and the weakest die"],
    'char_set': [list(" abcdefghijklmnopqrstuvwxyz,.")]
}

and here we define the number of trials of each experiment and the number of generations we let each one evolve for

num_trials = 5
max_iters = 1000

get a sample and see what happened. This looks like a promising parameter for an experiment

experiments = []
keys, values = zip(*independents.items())
experiments = [dict(zip(keys, v)) for v in itertools.product(*values)]
experiments[0]
{'population_size': 50,
 'mutation_rate': 0.0005,
 'cross_strat': 'midpoint',
 'target': 'multiply, vary, let the strongest live and the weakest die',
 'char_set': [' ',
  'a',
  'b',
  'c',
  'd',
  'e',
  'f',
  'g',
  'h',
  'i',
  'j',
  'k',
  'l',
  'm',
  'n',
  'o',
  'p',
  'q',
  'r',
  's',
  't',
  'u',
  'v',
  'w',
  'x',
  'y',
  'z',
  ',',
  '.']}

Dataframe for storing results

results = pd.DataFrame({
    'experiment_id': [],
    'trial_num': [],
    'population_size': [],
    'mutation_rate': [],
    'cross_strat': [],
    'target': [],
    'char_set': [],
    'generation_num': [],
    'avg_fitnesses': [], 
    'num_target_hits': [],
})

And the loop to run each experiment! We should be done

# experimental loop

for exp_id, experiment in tqdm(enumerate(experiments), total=len(experiments)):
    
    for trial_i in range(num_trials):
            
        e = Experiment(
            population_size = experiment['population_size'],
            mutation_rate = experiment['mutation_rate'],
            cross_strat = experiment['cross_strat'],
            target = experiment['target'],
            char_set = experiment['char_set'],
        )

        avg_fitnesses, num_target_hits, generation_num = e.evolve(max_iters)

        result = pd.DataFrame({
            'generation_num': generation_num,
            'avg_fitnesses': avg_fitnesses, 
            'num_target_hits': num_target_hits,
        })

        result['experiment_id'] = exp_id
        result['trial_num'] = trial_i
        result['population_size'] = experiment['population_size']
        result['mutation_rate'] = experiment['mutation_rate']
        result['cross_strat'] = experiment['cross_strat']
        result['target'] = experiment['target']
        result['char_set'] = ''.join(experiment['char_set'])

        results = pd.concat([results, result])
100%|██████████| 84/84 [1:53:36<00:00, 81.15s/it]

Wouah that took two hours to run. I guess I only expected something less than half an hour, but here we have the results. Let’s take a look and save it somewhere

results

experiment_idtrial_numpopulation_sizemutation_ratecross_strattargetchar_setgeneration_numavg_fitnessesnum_target_hits
00.00.050.00.0005midpointmultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.1.00.0520690.0
10.00.050.00.0005midpointmultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.2.00.0620690.0
20.00.050.00.0005midpointmultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.3.00.0648280.0
30.00.050.00.0005midpointmultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.4.00.0772410.0
40.00.050.00.0005midpointmultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.5.00.0806900.0
.................................
99583.04.01000.00.1000randommultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.996.00.1565000.0
99683.04.01000.00.1000randommultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.997.00.1574480.0
99783.04.01000.00.1000randommultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.998.00.1572070.0
99883.04.01000.00.1000randommultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.999.00.1609140.0
99983.04.01000.00.1000randommultiply, vary, let the strongest live and the...abcdefghijklmnopqrstuvwxyz,.1000.00.1588280.0

420000 rows × 10 columns

results.to_csv('results.csv')

Cool. Onward to Part 2: Analysis!