Creating Crosswords Minis with Genetic Algorithms

Introduction

The art of making crossword puzzles is a creative domain that has a culture all of its own. For cruciverbalists (those who solve crosswords), a good crossword puzzle depends heavily on aspects of novelty and value in thematic elements of the crossword (many crossword puzzles have a theme), but also in what is called the fill. The fill references all of the non-themed answers in the puzzle that "fill" in the rest of the grid. For themeless puzzles (often regarded as the more challenging puzzles to solve and create), the fill is everything. It is an art that requires finding words that are well-known enough (by some measure of "enough"), that fit together, that are exciting, and that can be clued (i.e., for which a clue can be written) in possibly creative ways that become puzzles of their own.

As a testament to the art of crossword making, consider that the New York Times (known for having the best crossword puzzles in the world largely thanks to the well-known crossword editor, Will Shortz) will pay between $500 to $2,500 per puzzle depending on the size, difficulty, and experience of the artist in creating NYT crosswords. Having submitted puzzles to the NYT and hearing stories from others who have submitted, there is an intricate and elegant art that goes into creating a successful puzzle, most of which is well-beyond the scope of this class or this project. At first glance what appears to be merely an optimization problem is actually a much more sophisticated and at times subjective art.

In this project you will design a genetic algorithm in Python for building themeless, mini crossword puzzles. Here is an example of a mini crossword puzzle from the New York Times:

New York Times Mini Crossword Puzzle from January 3, 2017

The purpose of this project is primarily to give you a chance to try out a genetic algorithm, but in the process you will see an example of a computational creative system and be asked to assess some of the creative aspects of the system. For purposes of simplification, we will assume a mini crossword puzzle with dimensions ranging from 1x1 to 5x5 with no black squares (all letters). We will also assume that for our purposes, a good crossword puzzle is simply a puzzle for which all of the words across and down are valid English words (as per the WordNet database).

Dependencies

The code was developed with Python 3.9.1 and depends on prior installation of the nltk library. Once installed, you need to call nltk.download(‘wordnet’) to install the resources needed for this project (you only need to do this once).

Requirements

The primary goal of this project is for you to learn how to use genetic algorithms as a conceptualization in a computational creativity system. We provide a Python framework from which to start. A puzzle is represented as a 2D array of uppercase letters (A-Z). The 2D array is square, with the length of a side defined by the global variable DIMENSION. Based on this representation, several helper functions are implemented for you. To complete this project you must implement the following function:

run_ga() — a function that runs a genetic algorithm and outputs the final population. Note that nothing in the definition of this function, with possibly the exception of crossword-specific print function calls, should specifically refer to the domain (crosswords). Domain-specificity should be handled in functions called from within this function. Inputs:

gens: max number of generations to run
population_size: number of individual genomes to preserve from one generation to the next
children_per_generation: number of children to generate from the population during each generation

It is assumed that pursuant to successful implementation of the run_ga() function that you will also implement functions for population initialization, crossover, mutation, and fitness. You should calculate the fitness of a puzzle as the number of unique words (i.e., a word that appears twice only counts once) across and down that are valid as per WordNet. Note that an initialize_crossword() function is implemented for you which creates a new, square 2D array of random uppercase letters. The format of the array returned by this function is compatible with the other helper functions implemented for you which include:

print_crossword: a function that prints the input crossword (represented as a 2D array of uppercase letters) to the console.
print_crossword_clues: a function that uses WordNet to lookup words that appear across and down in the input crossword puzzle and prints out their definitions to the console (invalid words are indicated as such).

You may find the print_crossword_clues() function implementation helpful to understand how to use the WordNet library elsewhere in your implementation if you want to check whether a word is valid. You may modify any of the helper code that you wish.

Deliverables

Once you are satisfied that your implementation is working correctly, you should write a report according to the following requirements. Compress your report and your code into a single ZIP file and upload to Moodle. The report should be written in LaTeX. Here is an Overleaf template (go to "Menu">"Copy Project") to help in writing the report. The report should follow the standard format for a scientific paper, with sections as indicated below (you don't need to do an Intro or Related Works for this project, but I include them as a reminder that they are part of the standard format). For each section include a subsection (per requirement) that briefly but thoroughly responds to each of the requirements below. Unless specified otherwise, assume 10000 generations, a population size of 20, 20 children per generation, a mutation rate of 0.5, and a puzzle with dimensions 4x4.

Introduction: you do not need to do an introduction for this report.

Related Works: you do not need to do a related works section for this report.

Methods: Describe each of the following in a separation subsection

Selection criteria: how do you choose which parents to “breed” when creating a new child?
Crossover method: once parents are chosen, how do you crossover the two parent genomes to create a new child genome? Justify your choice of method. Given the time and resources, can you think of a different way to perform a crossover that might help to keep “good” pieces of parent crosswords more intact during crossover? Why might this be worth considering?
Mutate method: once the child is created, how do you mutate the child? (a question only a computer scientist can appreciate) Did you “flip a coin” at each grid position? At each row? Justify your choice and comment on the pros and cons of these two specific approaches.
Fitness function: how do you evaluate the fitness of a particular solution? Based on how you evaluate fitness, is there an optimal score? If so, what would an optimal score be? Does your algorithm quit once an optimal solution is found?
Survival method: once you generate children, how do you choose who survives to the next generation?

Results: Include each of the following in a separate subsection

Show 3 graphs for 3 different runs with generation on the x-axis and best fitness score so far on the y-axis. Compare similarities and differences between the 3 graphs. With parameters staying the same, why aren’t these graphs all identical? In many machine learning algorithms, the algorithm is built to stop training after accuracy values cease to increase for a few epochs. Based on what you see in your graphs, do you think a similar stopping criteria is appropriate for a genetic algorithm (i.e., is it a good idea to end the genetic algorithm after fitness scores plateau for a few generations)? Why or why not?
Run your algorithm using population sizes of 1, 10, and 100. Show graphs for each run with generation on the x-axis and best fitness score so far on the y-axis. You should run the algorithm a few times for each population size and report results that represent an average run for the given population size. In general, as population size increases, what do you notice about the number of generations required to evolve good solutions? Based on your understanding of the algorithm, explain why you think this is the case. What do you notice about the runtime required per generation? Explain why this would be the trend.
Run your algorithm using mutation rates of 0.05, 0.25, and 0.75. Show graphs for each run with generation on the x-axis and best fitness score so far on the y-axis. You should run the algorithm a few times for each mutation rate size and report results that represent an average run for the given mutation rate. In general, as mutation rate increases, what do you notice about the number of generations required to evolve good solutions? Based on your understanding of the algorithm, explain why you think this is the case.
Run your algorithm on with puzzle dimensions of 1x1, 2x2, 3x3, 4x4, and 5x5. A the dimension increases, what do you notice about the number of generations required to evolve good solutions. Explain this trend in terms of the algorithm’s inner workings.
Show the best puzzle of size 5x5 you are able to obtain (we'll show these in class). Show the fitness score and the clues for the puzzle. Include the parameters used (i.e., mutation rate, population size, # of children generated at each round) to generate the puzzle.

Discussion

When developing this project, I had tried a different NLTK corpus that took significantly (i.e., ~100x) longer per word to check that word was valid. Describe what impact a slower fitness function might have on the effectiveness of a genetic algorithm.
Finding words that fit together can be challenging, and crossword creators often fall back on starting their search with words that regularly trade off vowels and consonants. Such words can be easier to put together. Consider a fitness function that in addition to rewarding valid words in the across and down directions also reward words that, though invalid, have proper spacing of vowels and consonants. What impact would this more nuanced fitness function have on helping to find better solutions faster (hint: consider how a fitness function that only rewards valid words scores a quasi-valid solution versus a completely garbage solution).
A puzzle with good fill is about more than just finding valid words. Besides simply scoring based on words being valid, what other factors might you think be important to evaluating how “good” a crossword is? (i.e., what is a “good” crossword, and how do you measure it?)
This is an example of a basic CC system. Based on the inner-workings of the system, comment on what aspect of the system represents (or could be added to represent) each of the following components of a CC system (as discussed by Ventura): representation (genotype and phenotype), knowledge base, aesthetic, conceptualization, generation, genotypic evaluator, translation, phenotypic evaluator. If you feel this is a domain in which it is unnecessary to work with both phenotypic and genotypic representations, justify your answer.
How novel, valuable, and intentional do you think the artifacts of this system are? Justify your answer in terms in terms of how we have discussed these characteristics in class (i.e., what does it take to be considered novel in this context? What does it take to be valuable? etc.) Would you say this system is being creative?

Conclusion: Briefly (i.e., 1-2 sentences) summarize what you accomplished in this project. Also briefly mention a few of the most significant things you feel you learned/took away from this project.

Appendix: include a URL to your code if available