Markov Models

Markov models are used when dealing with sequences. For example a sequence of letters, a sequence of words, or a sequence of events. It relies on probabilities to classify sequences as having relatively high or low probability. It can also be used to generate new sequences with a desired probability. They are trained on a particular language and training dataset.

Example: Given the patterns of letters in English, how probable is your name?

This activity is best done as a group split into partners, but can be done with even one or two people.

Find a partner. Ask them the first 3 letters of their name (e.g., "pau").
Open tinyurl.com/markovisu in a separate tab or window.
Find the probability that a word in English starts with the first letter in your name. In the top row of the matrix, click on the column associated with the first letter in the name (e.g., "p"). Write down the "% probability", but as a decimal (i.e., divide by 100) (e.g., ".0337").
For each of the remaining two letters, find the probability of transitioning from the previous letter (e.g., "p") in your name so far to the next letter ("e.g., "a") in your name:

Find the row that corresponds to the previous letter (e.g., first time you will find the row for "p"; the second time find the row for "a").
Find the column in that row that corresponds to the next letter (e.g., first time find the column for "a"; the second time find the column for "u").
Click on that cell to find the "% probability", but as a decimal (i.e., divide by 100) (e.g., first time is ".1207"; second time is ".0123").

Multiply the three decimals that you found. In summary, using our example we are computing: 0.0337 × 0.1207 × 0.0123 = 5.003×10^−5 (i.e., P("pau") = P("p") × P("a"|"p") × P("u"|"a"))

Follow-up Questions

Compare different name beginnings and their probabilities. Which name beginning had the highest probability? Which had the lowest? Given all of the words and spellings you know in the English language, do your results make sense?
Does the probability you got mean anything by itself or only as compared to other probabilities?
What happens to the probability each time you multiply by a "%probability" decimal? Based on your reasoning, is it fair to compare probabilities for names of different lengths?
Look at the matrix. Where do we see the highest probabilities? Where do you see the lowest? Does this agree with your intuition?
Try to think how you might generate a matrix like the one used in this game. Could you create one for another language? Could you create one for words instead of letters? How about for music notes, weather patterns, stock market trends, or DNA sequences?

Example 2: Generate a new word that fits the English language

Open tinyurl.com/markovisu in a separate tab or window.
Generate the first letter in your new word. Pick a random number between 0 and 100 (this is best done with a random number generator). As a working example, say we sampled 34.35. Starting from the left end of the top row of the matrix, add up the "% probability" for each cell in the row until your sum is greater than your random number. Which ever column you last added corresponds to the first letter of your new word. For our example, that letter is "i".
Use the following steps to generate each subsequent letter until you generate the "_" character (i.e., the first column in the matrix):

Find the row that corresponds to the previous letter (e.g., first time you will find the row for "i").
Pick a random number between 0 and 100. Starting from the left end of the row, add up the "% probability" for each cell in the row until your sum is greater than your random number. Which ever column you last added corresponds to the next letter of your new word.

The probability of the word you sampled is found using the same steps from the first example above where we calculated the probability of your name.

Follow-up Questions

In what ways do the words you generated look like English words?
In what ways do the words you generated not look like English words?
You should note that pairs of adjacent letters seem right, but that some triples seem awkward. Why might that be the case? Can you think of a way to modify the approach to make triples flow better? Quadruples?
Why didn't the process you followed generate an actual word from the English language? Could it have? If so, why can I with confidence guess that your word isn't an actual word?