MMOD does sequences - David Winter

A couple of people have contacted me in the last little while to ask whether mmod, my little R library for calculating population differentiation statistics, can be used to deal with sequence data.

When I was designing the pacakge I was really thinking about handling microsattelite genotypes, but the statistics that mmod calculates are actually all capable of dealling with almost any genetic data. Although the actualy calculations can be handled easily enough, it’s a bit of a pain converting standard sequence datasets into population genetic ones. So, I’ve written a little function to aid the conversion of the standard representation of DNA sequences in R (DNAbin objects) into the genind object that is used to store population genetic data.

I’ll use this post to show you how it works, and to compare the “identity-based” statisitics mmod caluclates with the standard AMOVA approach to population differentiation from sequence data.

AMOVA

The most widely used method of measuring population differentiation from sequence data is Excoffier’s AMOVA (Analysis of Molecular Variance). This method takes into account the relationship betwen haplotypes in terms of genetic distance between them and paritions variation in those distances into within- and among-population components. It also produces an Fst-like statistic (called Phi-statisitics) for every level of population strucure being considered.

Emmanuel Paradis’ library pegas can calulate AMOVA for a given dataset. Here’s an example use some faked-up data:

library(pegas)
data(woodmouse)
set.seed(123)
(fake_data <- woodmouse[sample(1:15, 100, replace = TRUE), ])

## 100 DNA sequences in binary format stored in a matrix.
## 
## All sequences of same length: 965 
## 
## Labels: No0908S No1114S No0910S No1206S No1208S No305 ...
## 
## Base composition:
##     a     c     g     t 
## 0.306 0.261 0.126 0.307 

To measure population differentiation we obviously need some populations, the amova function takes a vector assiging each individual in the dataset to a particular location (note, you aren’t limited to two levels of population structure, you can use R’s formula syntax to denote, say, ~samples/sites/regions):

pops <- factor(rep(c("Dunedin", "Mosgeil"), each = 50))
dm <- dist.dna(fake_data, "N")
(res <- amova(dm ~ pops))

## 
##   Analysis of Molecular Variance
## 
## Call: amova(formula = dm ~ pops)
## 
##           SSD   MSD df
## pops    41.72 41.72  1
## Error 7449.26 76.01 98
## Total 7490.98 75.67 99
## 
## Variance components:
##       sigma2 P.value
## pops  -0.686    0.71
## Error 76.013        
## 
## Variance coefficients:
##  a 
## 50 
## 

For what ever reason amova doesn’t give the Phi-statisitics for each analysis, but we can do that for ourselves:

sigma2 <- res$varcomp[1, 1]
(phi_st <- (sigma2/(sigma2 - res$tab[1, 2])))

##[1] 0.0161

Everthing else (mmod)

AMOVA is very useful, but, like G_st it can be difficult to compare between studies because the Phi-statistics are partially dependant on the total diversity in the dataset. There is not obvious way to correct for this effect, while mainting the information on how haplotypes relate to each other.

An alternative approach is to discard that ifnormation, and simply use the frequency of each haplotype present to calculate differentiation statistics in the same way you would for alleles from SNPs or microsattelites. The latest build of mmod has a function, as.genind.DNAbin to make this easy. It’s not on CRAN just yet, but you can install it from github. The function takes a DNAbin object and a vector of populations for each sequence:

library(devtools)
install_github(rep='mmod', username='dwinter')
library(mmod)
(fake_genind <- as.genind.DNAbin(x=fake_data, pop=pops))

## 
##    #####################
##    ### Genind object ### 
##    #####################
## - genotypes of individuals - 
## 
## S4 class:  genind
## @call: genind(tab = tab, pop = pops)
## 
## @tab:  100 x 15 matrix of genotypes
## 
## @ind.names: vector of  100 individual names
## @loc.names: vector of  1 locus names
## @loc.nall: number of alleles per locus
## @loc.fac: locus factor for the  15 columns of @tab
## @all.names: list of  1 components yielding allele names for each locus
## @ploidy:  2
## @type:  codom
## 
## Optionnal contents: 
## @pop:  factor giving the population of each individual
## @pop.names:  factor giving the population of each individual
## 
## @other: - empty -
## 

Once you’ve converted a sequence file into a genind object it behaves just as you’d expect:

diff_stats(fake_genind)

## $per.locus
##        Hs     Ht      Gst Gprime_st      D
## L1 0.9257 0.9332 0.008113    0.2165 0.2037
## 
## $global
##        Hs        Ht   Gst_est Gprime_st     D_het    D_mean 
##  0.925657  0.933228  0.008113  0.216513  0.203696  0.203696 
## 

It is interesting to note the very different measures of differentiation you get from Gst_est (Nei’s G_st) and other statisitics. It some ways this is the worst possible example for G_st, since the statistic is known to be biased upward when (a) there are a small number of sub-populations being considered and (b) the over-all deversity is high.