MMOD Does Sequences

A couple of people have contacted me in the last little while to ask whether mmod, my little R library for calculating population differentiation statistics, can be used to deal with sequence data.

When I was designing the pacakge I was really thinking about handling microsattelite genotypes, but the statistics that mmod calculates are actually all capable of dealling with almost any genetic data. Although the actualy calculations can be handled easily enough, it’s a bit of a pain converting standard sequence datasets into population genetic ones. So, I’ve written a little function to aid the conversion of the standard representation of DNA sequences in R (DNAbin objects) into the genind object that is used to store population genetic data.

I’ll use this post to show you how it works, and to compare the “identity-based” statisitics mmod caluclates with the standard AMOVA approach to population differentiation from sequence data.

AMOVA

The most widely used method of measuring population differentiation from sequence data is Excoffier’s AMOVA (Analysis of Molecular Variance). This method takes into account the relationship betwen haplotypes in terms of genetic distance between them and paritions variation in those distances into within- and among-population components. It also produces an Fst-like statistic (called Phi-statisitics) for every level of population strucure being considered.

Emmanuel Paradis’ library pegas can calulate AMOVA for a given dataset. Here’s an example use some faked-up data:

1
2
3
4
library(pegas)
data(woodmouse)
set.seed(123)
(fake_data <- woodmouse[sample(1:15, 100, replace = TRUE), ])
1
2
3
4
5
6
7
8
9
## 100 DNA sequences in binary format stored in a matrix.
## 
## All sequences of same length: 965 
## 
## Labels: No0908S No1114S No0910S No1206S No1208S No305 ...
## 
## Base composition:
##     a     c     g     t 
## 0.306 0.261 0.126 0.307 

To measure population differentiation we obviously need some populations, the amova function takes a vector assiging each individual in the dataset to a particular location (note, you aren’t limited to two levels of population structure, you can use R’s formula syntax to denote, say, ~samples/sites/regions):

1
2
3
pops <- factor(rep(c("Dunedin", "Mosgeil"), each = 50))
dm <- dist.dna(fake_data, "N")
(res <- amova(dm ~ pops))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
## 
##   Analysis of Molecular Variance
## 
## Call: amova(formula = dm ~ pops)
## 
##           SSD   MSD df
## pops    41.72 41.72  1
## Error 7449.26 76.01 98
## Total 7490.98 75.67 99
## 
## Variance components:
##       sigma2 P.value
## pops  -0.686    0.71
## Error 76.013        
## 
## Variance coefficients:
##  a 
## 50 
## 

For what ever reason amova doesn’t give the Phi-statisitics for each analysis, but we can do that for ourselves:

1
2
sigma2 <- res$varcomp[1, 1]
(phi_st <- (sigma2/(sigma2 - res$tab[1, 2])))
1
##[1] 0.0161

Everthing else (mmod)

AMOVA is very useful, but, like G_st it can be difficult to compare between studies because the Phi-statistics are partially dependant on the total diversity in the dataset. There is not obvious way to correct for this effect, while mainting the information on how haplotypes relate to each other.

An alternative approach is to discard that ifnormation, and simply use the frequency of each haplotype present to calculate differentiation statistics in the same way you would for alleles from SNPs or microsattelites. The latest build of mmod has a function, as.genind.DNAbin to make this easy. It’s not on CRAN just yet, but you can install it from github. The function takes a DNAbin object and a vector of populations for each sequence:

1
2
3
4
library(devtools)
install_github(rep='mmod', username='dwinter')
library(mmod)
(fake_genind <- as.genind.DNAbin(x=fake_data, pop=pops))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
## 
##    #####################
##    ### Genind object ### 
##    #####################
## - genotypes of individuals - 
## 
## S4 class:  genind
## @call: genind(tab = tab, pop = pops)
## 
## @tab:  100 x 15 matrix of genotypes
## 
## @ind.names: vector of  100 individual names
## @loc.names: vector of  1 locus names
## @loc.nall: number of alleles per locus
## @loc.fac: locus factor for the  15 columns of @tab
## @all.names: list of  1 components yielding allele names for each locus
## @ploidy:  2
## @type:  codom
## 
## Optionnal contents: 
## @pop:  factor giving the population of each individual
## @pop.names:  factor giving the population of each individual
## 
## @other: - empty -
## 

Once you’ve converted a sequence file into a genind object it behaves just as you’d expect:

1
diff_stats(fake_genind)
1
2
3
4
5
6
7
8
## $per.locus
##        Hs     Ht      Gst Gprime_st      D
## L1 0.9257 0.9332 0.008113    0.2165 0.2037
## 
## $global
##        Hs        Ht   Gst_est Gprime_st     D_het    D_mean 
##  0.925657  0.933228  0.008113  0.216513  0.203696  0.203696 
## 

It is interesting to note the very different measures of differentiation you get from Gst_est (Nei’s G_st) and other statisitics. It some ways this is the worst possible example for G_st, since the statistic is known to be biased upward when (a) there are a small number of sub-populations being considered and (b) the over-all deversity is high.

Rentrez - an R Package for Interfacing With the NCBI’s Databases

Biologists make a lot of data these days. In fact, we seem to make more data than we can really deal with. Databases like Pubmed, Fishbase, GBIF and the rest of the NCBI’s offerings are full of answers to questions that no one has got around to asking.

ROpenSci is a collection of libraries for the R progamming language that is designed to help biologists pull data from all over the web into their R sessions and speed the process of turning data into information. The ROpenSci team have already made a great set of tools, which allow researchers to work with a bunch of databases, and, just as importantly share their results as nice reproducable R code. Up until recently R has not really had a library for taking data from the NCBI’s various databses (including such massive sources of data as Genbank and Pubmed). I’m very happy to say that my library for doing just this, rentrez is now part of the ROpenSci family.

What rentrez does

At present the functions provided by rentrez cover the entire Eutils API. Basically, the functions take arguments provided by a user, produce the URL needed to query the NCBI’s API and fetches the resulting data. In most cases the functions return lists that contain the parts of the resulting file that are most likely to be useful as items. When the returned file is XML the list contains the XML file for those that want to dig deeper.

What you can do

Grab a fork! The idea of ROpenSci is to create a set of tools that are useful to as many scientists as possble. That means open data and open code. If you find a bug in what I’ve done I want to know about it, if you think it can be usefully extended feel free to pick it up and run. I’m also keen to here of use-cases other than the ones already outlined in the documentation.

An example usage

There are some short examples in the README file for the rentrez repository, but I thought I’d run through a longer one here. This is also reproducable blogging, the markdown underlying this post was processed with knitr from this source

Lately, I’ve been working on a little meta-analysis of phylogenies. In particualr, we’re interested in why sometimes different genes tell different stories about the relationships between species from which the come. In terms of being able to get the individual gene trees I need to do these analyses there are good, rather less good and quite bad papers out there. In the best cases I can just download the trees as nice, parsable newick files fromTreeBase (which has already been wrapped by ROpenSci). Sometimes I need to print out the trees from a paper and work with pencil and paper, which I can handle. In a few cases people haven’t actually published their individual gene trees, if I want to included these papers I need to replicate their work by downloading the gene sequences, aligning them and making new trees.

So, here’s an example of how I’ve been using rentrez to automate some of that process. I’m going to use a slightly convaluted process to get all the data, but that’s just so I can walk though a bunch of the rentrez functions. Let’s get started. Reece et al (2010, doi:10.1016/j.ympev.2010.07.013) presented a phylogeny of moray eels using four different genes, but didn’t publish the gene trees. I want to get the sequences underlying their analyses, which will be in the NCBI’s databases, so I can reproduce their results. To get data associated with this paper from the NCBI I need the PMID (pubmed ID), which I can find using the rentrez function entrez_search to query the pubmed database with the paper’s doi:

1
2
3
library(rentrez)
pubmed_search <- entrez_search(db = "pubmed", term = "10.1016/j.ympev.2010.07.013[doi]")
pubmed_search$ids
1
## [1] 20674752

All the functions in rentrez create a URL to get data from the NCBI, then fetch the resulting document, usually as an XML file. In most cases the functions will parse the most relevant sections of the XML file out and present them to you as items in a list (ids being one item of the pubmed_search list in this case).

OK, now we have the PMID, what data does NCBI have for this paper? The entrez_link function lets us find out. In this case the db argument can be used to limit the number of data sources to check, but I want to see every data source here so I’ll set this paramater to “all”:

1
2
NCBI_data <- entrez_link(dbfrom = "pubmed", id = pubmed_search$ids, db = "all")
str(NCBI_data)
1
2
3
4
5
6
7
8
9
10
11
## List of 10
##  $ pubmed_nuccore            : chr [1:119] "307082467" "307082465" "307082463" "307082461" ...
##  $ pubmed_popset             : chr [1:4] "307082412" "307075396" "307075338" "307075274"
##  $ pubmed_protein            : chr [1:118] "307082468" "307082466" "307082464" "307082462" ...
##  $ pubmed_pubmed             : chr [1:126] "20674752" "20375076" "19053846" "11430656" ...
##  $ pubmed_pubmed_combined    : chr [1:6] "20674752" "20375076" "19053846" "11430656" ...
##  $ pubmed_pubmed_five        : chr [1:6] "20674752" "20375076" "19053846" "11430656" ...
##  $ pubmed_pubmed_reviews     : chr "20674752"
##  $ pubmed_pubmed_reviews_five: chr "20674752"
##  $ pubmed_taxonomy_entrez    : chr [1:40] "876649" "876647" "876643" "876642" ...
##  $ file                      :Classes 'XMLInternalDocument', 'XMLAbstractDocument', 'oldClass' <externalptr> 

The most relevant data here is the from the popset database, which containts population and phylogenetic datasets. If I want to see what each of the four popset datasets associated with this paper are about I can use entrez_summary to have a look. This function can collect summaries from a lot of different databases, and, because the XML return by those databases isn’t conisitant doesn’t make any attempt to parse information from the resulting file. Instead you get a XMLInternalDocument object from the XML library, which you have to further process yourself. In this case, a little xpath gets the name of each dataset:

1
2
data_summaries <- entrez_summary(db = "popset", ids = NCBI_data$pubmed_popset)
xpathSApply(data_summaries, "//Item[@Name='Title']", xmlValue)
1
2
3
4
## [1] "Muraenidae cytochrome oxidase subunit 1 gene, partial cds; mitochondrial."
## [2] "Muraenidae recombination activating protein 2 gene, partial cds."         
## [3] "Muraenidae recombination activating protein 1 gene, partial cds."         
## [4] "Muraenidae cytochrome b gene, partial cds; mitochondrial."                

Ok, since we might expect nuclear and mitochondrial genes to hav different histories, let’s get sequences from each genome (the the COI and RAG1 datasets) using entrez_fetch. By specifying file_format="fasta" we will get characater vectors in the fasta format:

1
2
3
4
coi <- entrez_fetch(db = "popset", ids = NCBI_data$pubmed_popset[1], file_format = "fasta")
rag1 <- entrez_fetch(db = "popset", ids = NCBI_data$pubmed_popset[3], file_format = "fasta")
write(coi, "moray_coi_raw.fasta")
write(rag1, "moray_rag1_raw.fasta")

So I’ve got the data on hand - that’s all the I need rentrez for, but I might as well align these sequences and make gene trees for each. I’ll just do a quick and diry neighbor-joining tree using ape and we can clean up the long OTU names with the help of stingr. (I put the fussy work of cleaning the names and rooting the trees into a function clean_and_root):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
library(ape)
library(stringr)
clean_and_root <- function(tr, outgroup, resolved = TRUE) {
    tr$tip.label <- sapply(str_split(tr$tip.label, " "), function(x) paste(x[2:3],
        collapse = "_"))
    return(root(tr, outgroup, resolve.root = resolved))
}
par(mfrow = c(1, 2))
coi_ali <- muscle(read.dna("moray_coi_raw.fasta", "fasta"))
coi_tr <- nj(dist.dna(coi_ali, "k81"))
clean_coi_tr <- clean_and_root(coi_tr, "Uropterygius_macrocephalus")
plot(clean_coi_tr, direction = "rightwards", cex = 0.5)
rag_ali <- muscle(read.dna("moray_rag1_raw.fasta", "fasta"))
rag_tr <- nj(dist.dna(rag_ali, "k81"))
clean_rag_tr <- clean_and_root(rag_tr, "Uropterygius_macrocephalus")
plot(clean_rag_tr, direction = "leftward", cex = 0.5)

plot of chunk trees