Instructors Notes

Back

Week 1

Course preliminaries. Homework, exams, curving of grade, no late policy.

Introduction

What is a stochastic process? Simple example of stock price. A queue example. What values can $X_n$ take?
What do you want to be able to describe about the stochastic process?
Conditional probabilities. Law of total probability.
Markov property that helps simplify things significantly. The Markov property is best introduced using an example, where people know how things work. Write down the general joint probability. Then show the simplification that we’ve used in the example.
This gives a nice segue into introducing the transition probability function. $p(i,j)$. Tell them how to remember this.

Lecture 2

Remark that this can be done on any graph where we’ve seen the probability of seeing a step. Remember that the example also shows the “summing to one across rows” property of stochastic matrices.
How good is this assumption? What are some places where it doesn’t work? Does it work in the stock market? What about weather? Remark: What’s a good non-Markov process? For example, suppose you had $N$ closed boxes, and a prize in one of them. Call the location of the prize $Y$. Suppose ${X_n}_{n=1}^N$ is the Markov chain and each time, you pick randomly from the unopened boxes until you hit the prize. Once you hit the prize, $X_n$ stays put. Given that the prize has not been hit, the $X_n$ depends on the past sequence of boxes opened. This ought to be Markov unless there is some freak accident. But anyway, this is a pain in the butt to check.

Here’s a simpler example. Suppose you toss a fair coin; this is your zeroth toss. If it shows up heads, you pick up a coin with probability $p$ of heads and continue to toss it. If it shows up tails, you pick up a coin with probability $q$ of heads and continue to toss it. Let $X_i \in { H, T}$. $\begin{align} & \Prob(X_n = H | X_{n-1} = T,\ldots X_0 = H) = p \\ & \neq \Prob(X_n = H | X_{n-1} = T) \\ & = \Prob(X_n = H | X_{n-1} = T, X_0 = H) \Prob(X_0 = H) + \Prob(X_n = H | X_{n-1} = T, X_0 = T) \Prob(X_0 = T) \\ & = \frac{p+q}{2}. \end{align}$

Stochastic matrices. $n$ step transition densities.

Collect the transition probabilities into a matrix.
Do more examples: a simple weather model. iid coin flips. I will ask them to the iid coing flips model in their HW I think. This will introduce them to the geometric distribution. Remark: Include this thing in their HW.
Initial distribution. Why is it useful to collect these things into a matrix? Use the weather example to introduce initial distributions and matrix multiplication. One-step transition probabilities. $\begin{align} \Prob( X_1 = a) & = \sum_{x \in S} \Prob(X_1 = a | X_0 = x) \Prob(X_0 = x), \\ & = \sum_{x \in S} \phi(x) P_{xa},\\ & = \phi \mathbf{P} \end{align}$ Remark: Row multiplication advances the probabilities in the Markov chain.

Lecture 3

Prove $n$-step transition densities are given by the matrix powers.
State the “6th day is sunny” question in the weather problem and ask them to find the probability by taking $n^{th}$ powers.

Invariant distributions. Taking limits of $P^n$.

How would you understand the long term behavior of the system? You would want to set $n$ to infinity. Suppose the limit exists $\lim_{n \to \infty} \phi \P^{n} = \pi$; $\pi$ is a vector of probabilities.
$\begin{align} \pi & = \lim_{n \to \infty} \phi \P^n \\ & = \lim_{n \to \infty} \phi \P^{n+1} \\ & = \lim_{n \to \infty} \phi \P^{n} P \\ & = \pi P \end{align}$
That’s why $\pi$ is called the invariant probability distribution.
We want to ask, does $\pi$ exist? is $\pi$ is unique? What happens to the matrix $\P^n$? In fact every row of $\P^n$ should go to $\pi$, in which case, we can show that $v \P^n \to \pi$ for every $v$ that’s a probability vector.
Hard to take matrix powers. For what kinds of matrices can you find powers easily?
Review of linear algebra. This is a good time to review matrix diagonalization, eigenvalues, eigenvectors, determinants, etc. with $2 \times 2$ matrices. This is probably easiest to do with the weather example. Remark I realized at this stage that it was important to do some simulation. I will spend the week or so doing simulation and or programming.
Relationship between invariant distribution and eigenvectors. There is a good description in Lawler’s book about this and the Perron-Frobenius theorem. Essentially, you want to show diagonalization; if you do this, then you can take powers of diagonal matrices. Then you want to show how if $|\lambda| < 1$ for all the other eigenvalues, you can take a limit as $n \to \infty$ of $P^n$.

Lecture 4

Perron Frobenius theorem, Jordan form.

If $\P$ is a stochastic matrix, when can we show that it has $1$ as a simple eigenvalue, and when can we show that all other eigenvalues are strictly smaller than $1$.
It relies on the Perron-Frobenius theorem which gives the above for $\P$ with strictly positive entries and whose rows sum to $1$. In fact, it’s enough if $\P^n$ has strictly positive entries for some $n$. This is seen by comparing the eigenvalues and eigenvectors of $\P$ and $\P^n$.

Remark I will give them problems 1.1, 1.2, 1.3 and 1.4 to solve in class. This is a in-class tutorial.

Lecture 5

Let’s introduce two important examples, the random walk with reflection and the one with absorption.

Ex1: States ${0,\ldots,4}$. $0 \to 1$ with probability $1$ and $4$ to $3$ with probability $1$.

Ex2: States ${0,\ldots,4}$. $0 \to 0$ with probability $1$ and $4$ to $4$ with probability $1$.

Note the periodicity of the states. What’s the long time behavior of $P^n$ on both these examples.

Ex3: Suppose $S = {0,1,2,3,4}$ and simply write down the transition matrix that splits into two sets of states.
I will show them a video of Plinko and then do the random walk with reflected barriers. Here’s a video with skinny Drew Carey.
Then I will do an introduction to python. Tell them to get help online from pdfs, use stackoverflow and stackexchange to ask questions and then come to me with questions. Import numpy and scipy, two numerical libraries.
The random walk shows them clear periodicity. Show them how to do the random walk using python too.

Classification of states

Communicating states; states that communicate with each other. There must exist $m,n \geq 0$ such that $p_m(i,j) > 0$ and $p_n(j,i) > 0$. This is an equivalence relationship: reflexive ($i \leftrightarrow i$), symmetry, transitivity. The proof of transitivity is very important since it uses the following fundamental inequality. $p_{m+n}(i,k) \geq p_{m}(i,j) p_n(j,k) > 0.$ Remark: Time-homogeneous Markov chains. I didn’t mention this earlier, and I will do it now.

Lecture 6 This sort of equivalence relationship allows us to divide any space into separate sections. You can do it to groups, vector spaces, whatever. So usually you would write this as $S / \sim$, when $S$ is the state space.

Irreducibility. If the chain consists of only one communicating class, then the chain is irreducible. Recall examples $1$ and $2$. Notice that any matrix satisfying
- There is a unique eigenvector corresponding to the $1$ eigenvalue.
- All other eigenvalues are strictly less than $1$
is irreducible. Remember how we showed these two properties. Example $1$ (reflecting RW) does not satisfy the theorem (it has period $2$ as we shall see), but has one communicating class. Example $2$ (absorbing RW) has three.
Classes can also be either recurrent or transient. Again remember the absorbing random walk that illustrates these two.
Write down the general transition matrix of a Markov chain. It must have some recurrent classes and some transient classes.

Remark Maybe as HW it is good to show that if a Markov chain starting in a recurrent class never leaves it.

Periodicity

The greatest common divisor of $J_i$. Show properties of $J_i$: closed under addition and can show that $J_i$ must contain all but a finite number of multiples of $d$. Note that $d_i$ must depend on $J_i$. Then using irreducibility, show that $d_i$ is common to all all states $i$ in each irreducible class.

Lecture 7 I made a mistake in the periodicity proof. To fix this, simply draw a picture.

Remark I ought to give them a bunch of gap filling exercises in the new HW.

Irreducible, aperiodic chains. Big theorem. In the big theorem, note that is says that $\pi(i) > 0$. Where does this come from? It comes from the Perron-Frobenius theorem.

Remark Is there any easy way to directly prove that $\pi(i) > 0$?

In any case, the proof goes as follows. For each state $i$, there is an $r(i)$ such that $p_{n}(i,i) > 0$ for all $n > r(i)$. This is because $d = 1$; this is where aperiodicity is used. Then there is an $m(i,j)$ such that $p_{m(i,j)}(i,j) > 0$ because the chain is irreducible (there is only one communicating class). Pick $n = \max_{i}(r(i)) + \max_{i,j}(m(i,j))$. Then, $p_{n}(i,j) = p_{n - m(i,j) + m(i,j)} \geq p_{n - m(i,j)}(i,i) p_{m(i,j)}(i,j) > 0$ We ensured that we can always return to $i$, and then make the jump from $i$ to $j$! This means that all the entries of $P^n$ are positive.
Reducible or periodic chains. Divide into recurrent states and transient states. Assume for the time being that each recurrent state has a stationary distribution $\pi_k$. Then $p_n(i,j) \to \pi_k(j)$ if $i,j$ are in the same recurrent state. What happens if $i$ and $j$ are in different recurrent states? What if $j$ is in a transient state? What happens if $i$ is in a transient state and $j$ is in some recurrent state? In this last case, let $\alpha(k)$ be the probability that $i$ ends up in recurrent class $k$. Let $j \in R_k$ the $k$\textsuperscript{th} recurrent class. Then
$p_n(i,j) \to \alpha(k) \pi_k(j)$

Now we return to the periodic behavior. Show them the python notebook with the reflecting states. In this case the stationary distribution does not represent the limit of $P^n$. In fact

$\pi(i) = \frac12 \left( \lim_{n \to \infty} p_n(j,i) + p_{n+1}(j,i) \right)$

Write down the generalization to period $d$.

Remark Note also that $\pi(i)$ represents the fraction of time spent in state $i$.

Return Times

Suppose $X$ is a irreducible chain.

Let $Y(n,i)$ be the amount of time spent in state $i$. Then compute $\frac{1}{n} \E[ Y(j,n-1)

X_0 = i ]$ because it’s related to the transition probabilities. In fact,

$\frac{1}{n} \E[ Y(j,n-1) | X_0 = i ] = \frac{1}{n} \sum_{r=0}^{n-1} p_r(i,j)$ But we’ve show that (including the periodic case) that these averages must go to $\pi(j)$!

Remark I ought to give this as an exercise.

Now we will relate the stationary distribution to the return time to a state. This is a beautiful argument. Define $T$ to be the return time to state $i$; $T_k$ is the $k$th return. Then these returns are like independent rvs. Clearly $ k^{-1} \sum_k T_k \to \E[T]$. In other words, we have $k$ returns to state $i$ in approximately $k \E[ T ] $ steps. This means that the fraction of time we’ve spent in state $i$ is $\frac{1}{E[T]} = \pi(i)$
Do the two state example. The distribution of the two state example can be written down explicitly.

Remark: Again, this is HW. This is because there is a good review of taking expectation and whatever.

Next time, transient states, gambler’s ruin etc.

Lecture 9

Transient States

Announce that transient states is not included. Remind them about the quiz this Friday.

Take the absorbing random walk with transient states, draw the matrix.
Draw a matrix with
$\begin{equation} P = \begin{array}{cc} A & 0 \\ S & Q \end{array} \end{equation}$
Then show that for a transient state $(\sum_i P^i)_j = (\sum_i Q^i)_j$.
Then finally discuss the matrix $M = (I - Q)^{-1}$. Show how to sum over a row to find the total expected number of visits to transient states before entering a recurrent state.
Can also use this technique to find the expected number of steps to visit state $i$ from state $j$; simply make state $j$ absorbing. Show the example of the reflected random walk.
We wanted to ask the question, suppose there are at least two different recurrence classes. Then, if we wanted to find $\alpha(t,r_1)$, then we would get a cool formula for $A$, the matrix of $\alpha$ by conditioning on the first step. This gives me
$A = S + Q A \Rightarrow A = M S$
This is quite an interesting formula, that I can’t seem to get directly!

Then do absorbing random walk again, starting from the state $1$. It should give you a $3/4$ probability.

As a final example, I ought to do Gambler’s ruin. For gambler’s ruin, I need to show them how to solve difference equations. This I will do on wednesday.

More examples

Gamblers ruin, SRW on a circle, Urn model, Cell Genetics and Card Shuffling. Remark: I didn’t do the time to being absorbed. This should be on HW too.
Random walk on a circle, some nice questions about the cover time of the circle. Has a nice mix of conditioning and stuff.
Urn model. Would be good to show that the stationary distribution is the binomial distribution. Remark: should be on HW.
Remark on the mixing time of Card Shuffling and tell them the story of Persi Diaconis.
Maybe show them the google algorithm.

Week 6 (Week 4 in September)

I spent the previous week discussing exams and starting simulation from Chapter 11 in Ross. The exam had two problems, both on classifying states and drawing diagrams for Markov chains.

In the programming section, I’ve covered

How pick a uniform permutation.
How to use the cdf to pick continuous distributions.
What if it’s a pain to compute the cdf and take its inverse?
Von Neumann’s method to sample from densities. Notice the absolute continuity requirement in the algorithm. If $g(x) = 0$, you require $f(x) = 0$ since otherwise the ratio $f/g$ cannot be bounded.
Then I did sampling of uniform random variables in areas on the plane. This would be used in generating a normal random variable in 2D.

Remark: Ask them in HW how this thing would be a uniform random variable.
Generating 2D normal random variables with Box-Muller, and polar method.
Then I will do estimating integrals and expectations for $\E[ f(X_1,X_2,\ldots,X_n) ]$ using our Monte-Carlo method. Remark: I didn’t end up doing this. Maybe some other time.
Finally, I asked them to simulate the reflecting random walk, and estimate the stationary probability by

a. directly simulating the markov chain and looking at the proportion of time spent in each state. b. sampling from the stationary distribution using ‘coupling from the past’

Final Week of September I held a final week of programming, where they learned a couple of different ways to simulate Markov chains. Essentially, they looked at coupling from the past where you could directly simulate from the chain.

Now, I’m moving onto countable Markov chains.

Week 10 (October)

We finished countable Markov chains rather quickly, and I’m finishing up branching processes right now.

How was the quiz? How many think it was hard, how many think it was fair?
Thank them for feedback. I should comment on more help for python will be forthcoming.
Online resources: codeacademy.com, coursera.com, learn python the hard way. The first lesson is called “The Hard way is easier.” I understand that some of do not see the point of learning how to code in a math class. If you’re doing any kind of research, whether it is applied real-world research, or maybe you want to prove theorems, then being able to code is invaluable.
We will be doing theory for another chapter or so, where we will do continuous time Markov chains. After this, we will return to coding for a bit. Finally we will do a little queuing theory.

While reading about the branching process, I also learned about the things Francis Galton did. Quite remarkable! Eugenics, “regressing towards the mean”, Plinko (that we saw earlier in the course), was Darwin’s cousin.

Arjun Krishnan