Creating and Classifying Sequences with Markov Models

Introduction

Markov models are probabilitic sequence models that can be used for generating sequences according to patterns learned from other (training) sequences. A trained model can also be used to compute the probability of another sequence. In this project you'll get a chance to try both, generating and classifying sequences of words using Markov models trained on fictional literature sources.

In the project, you'll work with two datasets (i.e., works) from each of the 4 pictured authors (plus one dataset from Dr. Seuss, just for good measure). Can a model trained on one dataset from an author be used to identify another dataset from the same author? How could such a model be useful for computational creativity?

Dependencies

The code was developed with Python 3.9.

Requirements

The primary goal of this project is for you to learn how to use Markov models as a conceptualization model. We provide a Python framework and some datasets from which to start. The framework is all set to generate results (make sure that the data directory is in the same location as the main.py script). To get meaningful results, you are left to implement the Markov model class in the main.py file. This class requires (at a minimum) that you implement three class functions:

The constructor - Taking a list of sentences (where a sentence is a list of words) and an integer representing the Markov order, you need to train the models parameters.
generate_sequence() - a function that probabilistically generates a sequence of words based on the trained model parameters and returns it as a string.
compute_log_probability(test_sentences) - a function that uses the trained model parameters to compute the probability of the input text (a list of sentences (where a sentence is a list of words)).

Empty class datastructures are created for you in the constructor which should help to give you some idea of how to store and reference the parameters as you implement these functions. As a way to save space, one common strategy with training probabilistic models in general is to first use a datastructure to tally up counts of a particular occurrence (in this case a particular tuple or transition between two particular tuples) and then once all tallies are complete, then you can normalize these counts to form a probability distrubution (remember that a distribution sums to 1.0). A tuple (both in Python and in CS generally) is an ordered set of elements). We will refer to tuples instead of words because a higher-order Markov model (i.e., model where the Markov order >1) can be created by essentially representing each token as a tuple of words instead of a single word. In this representation, a transition between two tokens will have the result that the two tuples have overlapping words. For example, if I train on the phrase "I like computer science" and I'm using a Markov order of 2, one way to do this is to consider the phrase as a sequence of tuples, each of length 2: (I, like), (like, computer), (computer, science). Note that when generating a sequence, you will only add the last word in each sampled tuple except in the case of the starting tuple.

You are encouraged to add other class functions as needed in keeping with principles of abstraction and encapsulation.

As you work with dictionaries, note that you cannot use a mutable object (e.g., a list) as a key for a dictionary. Note that whereas lists are mutable, tuples in Python are not.

Implemented for you are several helper functions:

sample(distribution> - a function to pick a token t from a distribution (i.e., a dictionary representing a distribution) with probability p(t) where p(t) is the value associated with the key t in distribution.
parse_file(filename) - a function that parses a text dataset and converts it into a (cleaned) list of sentences (where each sentence is a list of words).
compute_vocab_size() - a function that counts all of the unique words (not tuples) in the training and test datasets. Because you'll only be running experiments that compute probabilities for OOV tokens using a Markov order of 1, this count will suffice as a representation of the size of your vocabulary.

Of these, you only ever need to call the sample(distribution) function as the other two are called in the main() function which is implemented for you.

Your model should also be trained on end-of-sequence tokens. One easy way to do this is to represent an end-of-sequence token as None. When generating sequences, your model should sample a next token until it samples the end-of-sequence token at which point it returns the sampled sequence.

It is common when computing probabilities for new sequences (in particular new sequences of words) to encounter tuples that were not seen during training. Since the model never saw it in training, it doesn't have a probability for the tuple. How do we deal with that? One solution is to say that the probability of that particular tuple is zero according to the model. This sets the probability of any sequence with unseen tuples to zero (since probabilities are multiplied). A better solution is to give some small probability to unseen tuples. How much probability should we give? This is an area of on-going research, but one simple solution is to give out-of-vocabulary (OOV) tokens or token transitions a probability of 1/vocab_size where vocab_size is the number of words in your vocabulary. What is "your vocabulary"? This is also a non-trivial design decision, but for the purposes of this lab, our vocabulary will be the set of all words seen an any of the training or test datasets. This value is computed for you as the first step in the main function.

One final note on probabilities: when you multiply probabilities over and over, the numbers become so small that eventually you get underflow where there aren't enough bits to maintain precision. In short, if you multiply probabilites when computing probabilities of sequences, you'll almost always get 0.0 as the final probability. How to deal with this? Instead of multiplying probabilities, we operate in the space of log probabilities. It's actually quite simple. To get a log probability, you take the log of the probability. Then when you would have normally multiplied probabilities, you add the log probabilities. Log probabilities are always negative numbers, but it's still the case that larger (i.e., less negative) numbers mean higher probability. We will only ever compare probabilities between different models, so feel free to compare log probabilities and not worry about bringing them back out of log space.

Deliverables

Once you are satisfied that your implementation is working correctly, you should write a report according to the following requirements. Compress your report and your code into a single ZIP file and upload to Moodle. The report should be written in LaTeX. Here is an Overleaf template (go to "Menu">"Copy Project") to help in writing the report. The report should follow the standard format for a scientific paper, with sections as indicated below (you don't need to do an Intro or Related Works for this project, but I include them as a reminder that they are part of the standard format). For each section include a subsection (per requirement) that briefly but thoroughly responds to each of the requirements below. Unless specified otherwise, assume a Markov order of 1 and the default training and test datasets..

Introduction: you do not need to do an introduction for this report.

Related Works: you do not need to do a related works section for this report.

Methods: Describe each of the following in a separation subsection

Briefly describe how you implemented your Markov model: what helper functions did you end up abstracting and why? What did you use for your OOV probability? How did you account for end-of-sequence tokens?
Describe the datastructures used to store the model parameters. What is the Big-O space requirement for your implementation in terms of the size of the alphabet n?
Give us some feedback on the time/effort spent on your implementation. What parts were particularly challenging? Which were straight-forward?

Results: Include each of the following in a separate subsection

Generation - For Markov orders 1 through 4, generate 10 sequences using the model trained on the training dataset of your choice (use the same dataset for all 4 orders).
1. From the 10 sequences generated for each order, pick your 2 favorites and include them in a table in your report together with the Markov order used to generate them. What do you notice as the Markov order increases? In the limit, consider what would happen if the Markov order were the length of the training text. What is the "right" Markov order?
2. Include a second table in which for each of the 4 Markov orders you report A) the Markov order, B) the time taken to train on the dataset you chose, C) the size of your Markov model in terms of the size of your start probabilities dictionary (i.e., # of entries) and D) your 2D trans probabilities dictionary (i.e., total number of inner dictionary entries). What impact does increasing the Markov order have on the training time and space required? In the limit, consider what would happen if the Markov order were the length of the training text.
3. Summarize the pros and cons of higher versus lower Markov orders.
Classification - Using a Markov order of 1, train a Markov model on each of the 5 default training datasets. Then, for each of the 4 default test datasets, compute the probability of the test dataset using each of the 5 models (note: this is already what the main function is setup to do). Include in your report a table where for each test dataset you report A) the name of the test dataset, B) the author of the test dataset (use Google if needed), C) the trained model that gave the highest probability D) the author of the training dataset that trained the model from part C, and E) whether the authors in parts B and D are a match (use abbreviations or acronyms to avoid making the table too big). Besides the table, give your comments on what you observed in this experiment: how well does a model trained on one work of an author do at identifying other works of that author? There exists a matching training dataset author for each test dataset. In cases where you didn't get a match, how close was the matching model's probability to the model with the highest probability? Why do you suppose this other model's probablity was higher? Were there particular trained models that seeemd to always give high or low probabilities? What might explain this?

Discussion: Include each of the following in a separate subsection

Comparing probabilities from a single model: Note that we did not have you compare probabilities of a single trained model across different test datasets. Does it make sense to compare probabilities computed of a single model on different datasets? (hint: think how the probability changes each time you add a word. How do probabilities for long datasets compare to those of short ones, generally speaking?)
Markov for CC: Besides training a model on datasets from different authors, comment on how you might use a Markov model trained on datasets representing different emotions, different musical genres, etc. How else might you use a Markov model for computational creativity?
This system as a CC model: This model could be considered as an example of a basic CC system. Based on the inner-workings of the system, comment on what aspect of the system represents (or could be added to represent) each of the following components of a CC system (as discussed by Ventura): representation (genotype and phenotype), knowledge base, aesthetic, conceptualization, generation, genotypic evaluator, translation, phenotypic evaluator. If you feel this is a domain in which it is unnecessary to work with both phenotypic and genotypic representations, justify your answer.
Evaluating this as a CC system: How novel, valuable, and intentional do you think the artifacts of this system are? Justify your answer in terms in terms of how we have discussed these characteristics in class (i.e., what does it take to be considered novel in this context? What does it take to be valuable? etc.) Would you say this system is being creative?
Any thoughts you have on how this lab might be improved.

Conclusion: Briefly (i.e., 1-2 sentences) summarize what you accomplished in this project. Also briefly mention a few of the most significant things you feel you learned/took away from this project.

Appendix: include a URL to your code if available

Acknowledgments

Credit to Project Gutenberg from which the training and test datasets were derived.