Markov models are probabilitic sequence models that can be used for generating sequences according to patterns learned from other (training) sequences. A trained model can also be used to compute the probability of another sequence. In this project you'll get a chance to try both, generating and classifying sequences of words using Markov models trained on fictional literature sources.
In the project, you'll work with two datasets (i.e., works) from each of the 4 pictured authors (plus one dataset from Dr. Seuss, just for good measure). Can a model trained on one dataset from an author be used to identify another dataset from the same author? How could such a model be useful for computational creativity?
The code was developed with Python 3.9.
The primary goal of this project is for you to learn how to use Markov models as a conceptualization model. We provide a Python framework and some datasets from which to start. The framework is all set to generate results (make sure that the data directory is in the same location as the main.py script). To get meaningful results, you are left to implement the Markov model class in the main.py file. This class requires (at a minimum) that you implement three class functions:
generate_sequence()
- a function that probabilistically generates a sequence of words based on the trained model parameters and returns it as a string.compute_log_probability(test_sentences)
- a function that uses the trained model parameters to compute the probability of the input text (a list of sentences (where a sentence is a list of words)).Empty class datastructures are created for you in the constructor which should help to give you some idea of how to store and reference the parameters as you implement these functions. As a way to save space, one common strategy with training probabilistic models in general is to first use a datastructure to tally up counts of a particular occurrence (in this case a particular tuple or transition between two particular tuples) and then once all tallies are complete, then you can normalize these counts to form a probability distrubution (remember that a distribution sums to 1.0). A tuple (both in Python and in CS generally) is an ordered set of elements). We will refer to tuples instead of words because a higher-order Markov model (i.e., model where the Markov order >1) can be created by essentially representing each token as a tuple of words instead of a single word. In this representation, a transition between two tokens will have the result that the two tuples have overlapping words. For example, if I train on the phrase "I like computer science" and I'm using a Markov order of 2, one way to do this is to consider the phrase as a sequence of tuples, each of length 2: (I, like), (like, computer), (computer, science). Note that when generating a sequence, you will only add the last word in each sampled tuple except in the case of the starting tuple.
You are encouraged to add other class functions as needed in keeping with principles of abstraction and encapsulation.
As you work with dictionaries, note that you cannot use a mutable object (e.g., a list) as a key for a dictionary. Note that whereas lists are mutable, tuples in Python are not.
Implemented for you are several helper functions:
sample(distribution>
- a function to pick a token t from a distribution (i.e., a dictionary representing a distribution) with probability p(t) where p(t) is the value associated with the key t in distribution
.parse_file(filename)
- a function that parses a text dataset and converts it into a (cleaned) list of sentences (where each sentence is a list of words).compute_vocab_size()
- a function that counts all of the unique words (not tuples) in the training and test datasets. Because you'll only be running experiments that compute probabilities for OOV tokens using a Markov order of 1, this count will suffice as a representation of the size of your vocabulary.Of these, you only ever need to call the sample(distribution)
function as the other two are called in the main()
function which is implemented for you.
Your model should also be trained on end-of-sequence tokens. One easy way to do this is to represent an end-of-sequence token as None. When generating sequences, your model should sample a next token until it samples the end-of-sequence token at which point it returns the sampled sequence.
It is common when computing probabilities for new sequences (in particular new sequences of words) to encounter tuples that were not seen during training. Since the model never saw it in training, it doesn't have a probability for the tuple. How do we deal with that? One solution is to say that the probability of that particular tuple is zero according to the model. This sets the probability of any sequence with unseen tuples to zero (since probabilities are multiplied). A better solution is to give some small probability to unseen tuples. How much probability should we give? This is an area of on-going research, but one simple solution is to give out-of-vocabulary (OOV) tokens or token transitions a probability of 1/vocab_size
where vocab_size
is the number of words in your vocabulary. What is "your vocabulary"? This is also a non-trivial design decision, but for the purposes of this lab, our vocabulary will be the set of all words seen an any of the training or test datasets. This value is computed for you as the first step in the main
function.
One final note on probabilities: when you multiply probabilities over and over, the numbers become so small that eventually you get underflow where there aren't enough bits to maintain precision. In short, if you multiply probabilites when computing probabilities of sequences, you'll almost always get 0.0 as the final probability. How to deal with this? Instead of multiplying probabilities, we operate in the space of log probabilities. It's actually quite simple. To get a log probability, you take the log of the probability. Then when you would have normally multiplied probabilities, you add the log probabilities. Log probabilities are always negative numbers, but it's still the case that larger (i.e., less negative) numbers mean higher probability. We will only ever compare probabilities between different models, so feel free to compare log probabilities and not worry about bringing them back out of log space.
Once you are satisfied that your implementation is working correctly, you should write a report according to the following requirements. Compress your report and your code into a single ZIP file and upload to Moodle. The report should be written in LaTeX. Here is an Overleaf template (go to "Menu">"Copy Project") to help in writing the report. The report should follow the standard format for a scientific paper, with sections as indicated below (you don't need to do an Intro or Related Works for this project, but I include them as a reminder that they are part of the standard format). For each section include a subsection (per requirement) that briefly but thoroughly responds to each of the requirements below. Unless specified otherwise, assume a Markov order of 1 and the default training and test datasets..
main
function is setup to do). Include in your report a table where for each test dataset you report A) the name of the test dataset, B) the author of the test dataset (use Google if needed), C) the trained model that gave the highest probability D) the author of the training dataset that trained the model from part C, and E) whether the authors in parts B and D are a match (use abbreviations or acronyms to avoid making the table too big). Besides the table, give your comments on what you observed in this experiment: how well does a model trained on one work of an author do at identifying other works of that author? There exists a matching training dataset author for each test dataset. In cases where you didn't get a match, how close was the matching model's probability to the model with the highest probability? Why do you suppose this other model's probablity was higher? Were there particular trained models that seeemd to always give high or low probabilities? What might explain this?Credit to Project Gutenberg from which the training and test datasets were derived.