Chapter 8 Fisher-Wright

8.1 Outline

  • COVID
  • Fisher Wright vs Galton Branching Process
  • FW with mutation
  • Extinction
  • Application: Baby Names

Additional resources:

  • Blog “Introduction to the Wright-Fisher Model” https://stephens999.github.io/fiveMinuteStats/wright_fisher_model.html#pre-requisites

  • Hahn and Bentley (2003): Evolutionary anthropologists arguing that the neutral explanation of the Fisher-Wright model is consistent with the distribution of 1st names. What other quantitative or qualitative features of 1st name fashion could be used to try to reject the neutral model?

  • Felsenstein (2005): Very complete “lecture notes” for graduate genetics course. Lots of good commentary, does not assume a lot of math background, but lots of content and can be diffcult to read a piece by itself.

8.2 Parallel

Fisher-Wright

  • Children picking their parents (not “generative”)
  • Total population size is constant: process goes backwards
  • Qualitatively similar to BP. Extinction and fixation.
  • Flexible: mutation, selection, even changes in pop size.
  • With apologies, biologists take FW “seriously” even if they don’t take it “literally”.

Galton-Watson-Bienaymè Branching Processe

  • Branching process models independent parents randomly producing offspring. “Generative”
  • Total population size can vary, and has a random component and deterministic one \(m\)
  • Qualitative result when \(m = 1\) is that there is one longest surviving line. This is “fixation”, when one type becomes universal.

8.2.1 Another cell phone example

As in the Branching Processes chapter, let’s simulate the creation of generations using the numbers of cell-phone numbers (which are random). For this case, we need as many cell phone numbers as the generations to create, but each next cell phone number is a combination of digits from the first cell phone number.

Let these ficticious cell phone numbers be: 731 660 5362 and 530 666 7723. Note how the second number does not contain any 4s, 8s or 9s.

Generation 1 starts off with a sequence of numbers from 0 to 9. Then for generation 2, we assign each digit of the cell phone to one of cells. We repeat this for generation 3 but with the second cell phone number.

  knitr::kable(
  cbind(c(1:3), rbind(c(0:9),c(7, 3, 1, 6, 6, 0, 5, 3, 6, 2), 
        c(5, 3, 0, 6, 6, 6, 7, 7, 2, 3))),
  col.names = c('generation #', 'allele type', "", "", "", "", "", "", "", "", ""))%>%
  kableExtra::kable_styling(full_width = F) 
generation # allele type
1 0 1 2 3 4 5 6 7 8 9
2 7 3 1 6 6 0 5 3 6 2
3 5 3 0 6 6 6 7 7 2 3

If we translate this graph into a tally of the alleles by generation we see that some alleles are passed/‘’survive’’ more than others.

  knitr::kable(
  cbind(c(1:9), c(rep(1,9)), c(1, 1, 2, 0, 1, 3, 1, 0, 0), 
        c(1, 1, 2, 0, 1, 3, 2, 0, 0)),
  col.names = c('Allele type', "1", "2", "3"))   %>% add_header_above(c(" " = 1, "Count per generation" = 3))%>%
  kableExtra::kable_styling(full_width = F) 
Count per generation
Allele type 1 2 3
1 1 1 1
2 1 1 1
3 1 2 2
4 1 0 0
5 1 1 1
6 1 3 3
7 1 1 2
8 1 0 0
9 1 0 0
  • Here, allele type 6 will most likely dominate. This means that everybody will likely become similar and that individual types disappear.

NOTE FOR JOSH: does this explanation make sense? I had a hard time following the cell-phone number example in class so my notes are incomplete. Also, it’s not so straightforward to me how the “Children are choosing their parents here”. Could you explain this more?

8.3 Mutation

Let’s simulate the process from the example below. We first present some useful functions.

  fwm <- function(N, n_gen, mu = 0) ## mu != 4/N
  {
      ## simulate fisher-wright (with mutations)
      x <- paste(1:N) ## starting types
      A <- matrix(NA, nrow = n_gen, ncol = N)
      for (i in 1:n_gen)
      {
          A[i,] <- x
          x <- sample(x, size = N, replace = T) # Sample from previous generation and draw with replacement
          x <- mut(x, mu) 
          x
      }
      return(A) ## matrix of types, each line a generation.
  }

# Function to detect number of mutations. This occurs with probability mu. 
  mut <- function(x, mu)
  {
      ## m, the individuals that mutate
      m <- which(rbinom(length(x), size= 1, prob = mu) == 1)
      if (length(m) == 0) ## if no-one mutates
          return(x)
      ## add a suffix to their ID, so it will be unique (infinite alleles)
      suffix <- 10000*round(runif(length(m)),4)
      x[m] <- paste0(x[m], ".", suffix)
      x
  }

Here we try it out. - Each row is a generation and each column is a type of allele. - Notice that already by the third generation, there are several numbers that are not drawn anymore: 3,4,6,8,9.

  set.seed(1)
  fwm(N = 10, n_gen = 6, mu = 0)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" 
## [2,] "9"  "4"  "7"  "1"  "2"  "7"  "2"  "3"  "1"  "5"  
## [3,] "2"  "5"  "7"  "5"  "2"  "1"  "2"  "2"  "1"  "1"  
## [4,] "2"  "2"  "5"  "1"  "1"  "2"  "5"  "7"  "1"  "1"  
## [5,] "1"  "2"  "1"  "1"  "1"  "1"  "5"  "2"  "1"  "7"  
## [6,] "1"  "5"  "2"  "1"  "7"  "5"  "1"  "7"  "1"  "2"

The graph below shows a simulation of 20 generations each of size 10. The colors represent different types drawn in each generation.

  set.seed(1)
  A <- fwm(N = 10, n_gen = 20, mu = 0)
  tt <- table(A, row(A)) ## count types by row
  ptt <- prop.table(tt, 2) ## proportions
Generation composition

Figure 8.1: Generation composition

Let’s analyze it more.

  • By generation 5, many of the lines are flat (ocurrence = 0%) which means that they go extinct and the following generations can only select types from a reduced sample.
  • For instance line “5” (pink) is selected until generation 10. Around generation 7, it becomes the most prevalent type but then falls as other types are drawn relatively more. As we are looking at the ocurrence in samples were types disappear, the rises and falls in lines are indicate the substitution of numbers as some go extinct.
  • By generation 15, all draws are 2s and the other numbers do not ‘’survive’’
  • \(E(p_i(t) | p_i(t-1))\) is the likelihood that the same number (type) will be drawn given that it was drawn before. NOTE TO JOSH: is this really what the expectation means? Not on slides

This time, let’s increase the number of generations (200) and the number of types drawn at every step. In the long run, many of the colored lines disappear and the generations are composed of a handful of types.

  set.seed(1)
  A <- fwm(N = 100, n_gen = 200, mu = 0)
  tt <- table(A, row(A)) ## count types by row
  ptt <- prop.table(tt, 2) ## proportions
Generation composition long term

Figure 8.2: Generation composition long term

8.4 Fixation

Given that we saw before that some of the lines will disappear, we can look about keeping some of the lines fixed (that it will always be selected). For instance,

  • What is probability that line \(i\) will “fix”?
  • What is expected time until some line fixes?
  • How can we describe the path to fixation?

8.4.1 Probability that a particular line will “fix”

NOTE TO JOSH: the slide said that finding this was easy but I don’t necessarily see how the result from the graph goes with the formula below.

  set.seed(1)
  A <- fwm(N = 10, n_gen = 20, mu = 0)
  tt <- table(A, row(A)) ## count types by row
  ptt <- prop.table(tt, 2) ## proportions
  matplot(t(ptt), type = 'l', lty = 1, main = "FW simu")
  text(x = 4, y = jitter(ptt[,4]), rownames(ptt), col = 1:6)
Time to fix

Figure 8.3: Time to fix

8.4.2 Expected time until fixation?

Answer for us is \[ \bar{T}_{fixed} = 2 \cdot N \]

  • Note: Biologists say \(4 N_e\). See Wikipedia “Genetic drift”

  • Now we carry out simulations of complete processes. In total, we draw 100 times each generation in order to get the time to fixation. For this example, the mean number of generations taken to get only one remaining line is 202 generations \(\approx 2 \times\) the starting generation size.

  T.vec <- NULL 
  all.the.same <- function(x){length(unique(x)) == 1}
  set.seed(10)
  for (i in 1:100)  # 100 simulations
  {
      A <- fwm(N = 100, n_gen = 1000,mu = 0)   # generation process: rows are generations and columns are types
      extinction_time = min(row(A)[apply(A, 1, all.the.same)])  # obtain the first generation when all but one line went extinct
      T.vec[i] <- extinction_time
  }
  mean(T.vec)
## [1] 202.89

8.4.3 Path to fixation: a measure of homogeneity/heterogeneity

  • How does the type change over time?
  • We would like a measure of equality of the types of populations. A low measure would mean that the population is homogeneous while a high measure implies population heterogeneity.
  • Let the chance that two randomly drawn individuals are of same type be: \[ G = \sum_i p_i^2 \]
  • If we have two types, \(p_1 = \pi\) , \(p_2 = 1-\pi\). What is G if \(\pi = 0, .5, 1\)?
type_fun  <- function(pi_vector){
  oneminus_pi <- 1-pi_vector
  G <- pi_vector^2 + oneminus_pi^2
  
  return(plot(pi_vector, G, type="l", xlab = "Probability type", col="red"))
}

pi_vector <- seq(0,1,0.1)
type_fun(pi_vector)
Probability of being the same type

Figure 8.4: Probability of being the same type

  • Let’s derive time path of G.
    • Assume just two types, \(\pi(t)\)

The chance that two individuals are of same type \[G_{t+1} = P(\mbox{same parent})\cdot 1 + P(\mbox{different parent})\cdot G_{t}\]

  • Notation: I’m going use K for pop size. Bio uses 2N.

The probability of having the same parent is \(P(\mbox{same parent}) = \frac{1}{K}\), where \(K\) is the population size. This also implies that anybody in the population can be a parent with equal probability.

Additionally, \(P(\mbox{different parent}) = 1- P(\mbox{same parent}) = 1- \frac{1}{K}\)

Then, \[G_{t+1} = \left(\frac{1}{K}\right) \cdot 1 + \left(1 - \frac{1}{K}\right) \cdot G_{t}\]

For simplicity, let \(H = 1 - G\):

\[ \begin{aligned} 1 - H_{t+1} & = {1 \over K} + \left(1 - {1 \over K} \right) \cdot (1-H_{t})\\ & = {1 \over K} + 1 - H_{t} - {1\over K}+ {1\over K} H_{t} \\ H_{t+1} & = H_{t} - {1\over K} H_{t} \\ H_{t+1} & = H_{t} (1 - 1/K) \end{aligned} \]

For many periods, we start at the initial probabilities.

\[ H_{t} = H_0 (1 - 1/K)^t \rightarrow H_0 e^{-t/K} \] So, H goes to 0 exponentially, just as G goes to 1.

8.5 Baby Names

For this section we will look at an application of changes in baby names from this paper:

“Drift as a mechanism for cultural change: an example from baby names” by Matthew W. Hahn and R. Alexander Bentley Proc. R. Soc. Lond. B 2003 270, S120-S123

8.5.1 Basic idea

  • The process seems to be like Fisher-Wright because:
    • people choose from existing set
    • names are “neutral”
    • draw proportionally
  • Authors test to see if they can reject FW by comparin observed histograms to FW simulation
  • They include mutation to get a stationary distribution. Note: failing to reject FW doesn’t mean it’s correct

Baby Names by Matthew W. Hahn and R. Alexander Bentley Proc. R.Soc. Lond. B 2003 270, S120-S123 Source: Hahn and Bentley (2003)

8.5.2 Fisher-Wright simulation of Baby Names (Hahn and Bentley)

We download and prepare the data:

download.file(url= "https://www.ssa.gov/oact/babynames/names.zip",
              "./names.zip")
unzip("names.zip", exdir = "./names")

library(data.table)

filenames <- system("ls ./names/*.txt", intern = T)
mylist <- vector("list", length(filenames))
names(mylist) <- gsub(pattern = "[^0-9]", replace = "", filenames)
for (i in 1:length(filenames))
{
    myfile <- filenames[i]
    mylist[[i]] <- fread(myfile)
}

# Final dataframe
dt <- rbindlist(mylist, idcol = "year")
names(dt) <- c("year", "name", "sex", "N")

## Focus on male names from 1900-1909
my.dt <- dt[sex == "M" & year %in% 1900:1909]
foo <- my.dt[, .(N = sum(N)), by = name]
foo <- foo[order(N, decreasing = T)]
bar <- foo[1:1000,] ## 1000 top names

# Observed histogram of name occurence
my.breaks <- c(0, 2^(0:11)/10000)
bar[, p := round(prop.table(N),5)] # N column contains the share of times that a name is used
bar[, pcat := cut(p, breaks =  my.breaks, right = F, include.lowest = T)]

out <- unclass(prop.table(table(bar$pcat)))


my.x <- my.breaks[-length(my.breaks)] + diff(my.breaks)/2

Then we look at observed frequencies and they seem to behave as a power law (in log-log)

plot(my.x, out, log = "xy", xlab="Frequency of name", ylab= "log(share)")
Power

Figure 8.5: Power

8.5.3 Drawing their picture with simulation

Let’s look at evolution over time of G: chance that two individuals are of same type.

get.G <- function(x)
{
    tt <- table(x)
    p <- prop.table(tt)
    sum(p^2)
}

First we look at the case were there is no mutation

A <- fwm(1000, n_gen = 4000, mu = 0)
G.vec <- apply(A, 1, get.G)
plot(G.vec, ylab = "G", xlab= "# Generation", type ="l")
G path (without mutation)

Figure 8.6: G path (without mutation)

With mutation a single trial shows that the G value oscillates a lot around 0.1.

N = 1000
A <- fwm(N, n_gen = 3000, mu = 4/N)
G.vec <- apply(A, 1, get.G)
plot(G.vec, ylab = "G", xlab= "# Generation", type ="l")
G path (with mutation, 1 trial)

Figure 8.7: G path (with mutation, 1 trial)

We can increase the number of trials (up to 100 here) and look at the average (yellow horizontal line).

n_gen = 2000
n_trials = 100
G.mat <- matrix(NA,  n_trials, n_gen)
for (i in 1:n_trials)
    {
        N = 1000
        A <- fwm(N, n_gen, mu = 4/N)
        G.vec <- apply(A, 1, get.G)
        G.mat[i,] <- G.vec
        }
G.bar <- apply(G.mat, 2, mean)

matplot(t(G.mat), type = "l", ylab = "G", xlab= "# Generation")
lines(G.bar, lwd = 4)
abline(h = 1/9, lty = 3, col = "yellow", lwd = 5)
G path (with mutation, 100 trials)

Figure 8.8: G path (with mutation, 100 trials)

Why is the mean about .11?

1/(1 + 8)

Gillespie tells us that \(\bar{G}\) is supposed to be \(\frac{1}{(1 + 4*Ne*\mu)}\)

How does \(4*Ne*\mu = 8\)?

Well, we have \(K*\mu = 4\) and since \(K = 2*Ne\), \(Ne = K/2\) (maybe)

8.5.4 FW babyname simulation of equilibrium frequencies

NOTE TO JOSH: I don’t really know what is going on here and I noticed that the code mentions that it might not be (w)right.

N = 1000 ## Not sure if this is (w)right :)
mu = 4/N ## [1] 0.004
theta = N*mu ## [1] 4 ## H&B's "best fit"

8.6 Now we can simulate babynames

n_gen = 1001
N = 1000
## set.seed(1)
## A <- fwm2(N, n_gen, mu = 4/N)
#A <- fwm2(N, n_gen, mu = 8/N)
###############
#What about the fwm2 function?
#######################
A <- fwm(N, n_gen, mu = 8/N)
## ok, lets do power law plot of this
x <- A[1001,]
tt <- table(x)
ptt <- prop.table(tt)
my.breaks <- c(0, 2^(0:11)/10000)
p <- ptt
## bar[, p := round(prop.table(N),5)]
##bar[, pcat := cut(p, breaks =  my.breaks, right = F, include.lowest = T)]
pcat = cut(p, breaks =  my.breaks, right = F, include.lowest = T)
out <- unclass(prop.table(table(bar$pcat)))
out.hat <- unclass(prop.table(table(pcat)))
my.x <- my.breaks[-length(my.breaks)] + diff(my.breaks)/2
plot(my.x, out, log = "xy")
lines(my.x, out.hat)
Baby Names Simulation

Figure 8.9: Baby Names Simulation

8.7 Conclusions

  • Fisher-Wright an alternative to branching processes
  • It reverses logic of reproduction, but gives similar quantitative and qualitative results
  • A neutral model for other processes?
  • Starting point for coalescent

8.7.1 Some potential criticism

  • While we can’t reject that there’s some parameterization of FW that gives us similar disn, this doesn’t mean that we’ve found the right mechanism. (Just that we can’t reject it).
  • What are some other tests of this mechanism?
  • Markov assumption. We could see if each frequency really followed random walk.
  • Perhaps we could see if variances were scaled to frequencies correctly.

References

Felsenstein, Joseph. 2005. “Theoretical Evolutionary Genetics Joseph Felsenstein.” University of Washington, Seattle. https://evolution.gs.washington.edu/gs562/2017/pgbook.pdf.

Hahn, Matthew W, and R Alexander Bentley. 2003. “Drift as a Mechanism for Cultural Change: An Example from Baby Names.” Proceedings of the Royal Society of London. Series B: Biological Sciences 270 (suppl_1): S120–S123.