--- title: "Starting secsse" author: "Thijs Janzen" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Starting secsse} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ## Secsse introduction secsse is an R package designed for multistate data sets under a concealed state and speciation ('hisse') framework. In this sense, it is parallel to the 'MuSSE' functionality implemented in 'diversitree', but it accounts for finding possible spurious relationships between traits and diversification rates ('false positives', Rabosky & Goldberg 2015) by testing against a 'hidden trait' (Beaulieu et al. 2013), which is responsible for more variation in diversification rates than the trait being investigated. ### Secsse input files Similar to the 'diversitree' (Fitzjohn et al. 2012) and 'hisse' (Beaulieu & O'Meara 2016) packages, secsse uses two input files: a rooted, ultrametric tree in nexus format (for conversion of other formats to nexus, we refer to the documentation in package 'ape') and a data file with two columns, the first containing taxa names and the second a numeric code for trait state with a header (usually 0, 1, 2, 3, etc., but notice that 'NA' is a valid code too, if you are not sure what trait state to assign to a taxon). Here, we will use a simple trait dataset with only values 0 and 1, indicating presence and absence of a trait. A comma-separated value file (.csv) generated in MsExcel works particularly well. The \*.csv file can be loaded into R using the read.csv() function. and should look like this: ```{r} library(secsse) data(traits) tail(traits) ``` This data set (here we see only the bottom lines of the data frame) has two character states labeled as 0 and 1. Ambiguity about trait state (you are not sure which trait state to assign a taxon too, or you have no data on trait state for a particular taxon), can be assigned using 'NA'. secsse handles 'NA' differently from a full trait state, in that it assigns probabilities to all trait states for a taxon demarcated with 'NA'. The second object we need is an ultrametric phylogenetic tree, that is rooted and has labelled tips. One can load it in R by using read.nexus(). In our example we load a prepared phylogeny named "phylo_vignette": ```{r} data("phylo_vignette") ``` For running secsse it is important that tree tip labels agree with taxon names in the data file, but also that these are in the same order. For this purpose, we run the following piece of code prior to any analysis: ```{r} sorted_traits <- sortingtraits(traits, phylo_vignette) ``` If there is a mismatch in the number of taxa between data and tree file, you will receive an error message. However, to then identify which taxa are causing issues and if they are in the tree or data file, you can use the name.check function in the 'geiger'(Harmon et al. 2008) package: ```{r} library(geiger) #pick out all elements that do not agree between tree and data mismat <- name.check(phylo_vignette, traits) #this will call all taxa that are in the tree, but not the data file #mismat$tree_not_data #and conversely, #mismat$data_not_tree ``` If you have taxa in your tree file that do not appear in your trait file, it is worth adding them with value `NA` for trait state. You can visualise the tip states using the package diversitree: ```{r plot_tree} if (requireNamespace("diversitree")) { for_plot <- data.frame(trait = traits$trait, row.names = phylo_vignette$tip.label) diversitree::trait.plot(phylo_vignette, dat = for_plot, cols = list("trait" = c("blue", "red")), type = "p") } ``` After you are done properly setting up your data, you can proceed to setting parameters and constraints. #### Note on assigning ambiguity to taxon trait states If the user wishes to assign a taxon to multiple trait states, because he/she is unsure which state best describes the taxon, he/she can use `NA`. `NA` is used when there is no information on possible state at all; for example when a state was not measured or a taxon is unavailable for inspection. `NA` means a taxon is equally likely to pertain to any state. In case the user does have some information, for example if a taxon can pertain to multiple states, or if there is uncertainty regarding state but one or multiple states can with certainty be excluded, secsse offers flexibility to handle ambiguity. In this case, the user only needs to supply a trait file, with at least four columns, one for the taxon name, and three for trait state. Below, we show an example of what the trait info should be like (the column with species' names has been removed). If a taxon may pertain to trait state 1 or 3, but not to 2, the three columns should have at least the values 1 and a 3, but never 2 (species in the third row). On the other hand, the species in the fifth row can pertain to all states: the first column would have a 1, the second a 2, the third a 3 (although if you only have this type of ambiguity, it is easier to assign `NA` and use a single-column data file). ```{r} # traits traits traits # [1,] 2 2 2 # [2,] 1 1 1 # [3,] 2 2 2 # [4,] 3 1 1 # [5,] 1 2 3 ``` ## Setting up an analysis To perform a Maximum Likelihood analysis, secsse makes use of the function `DDD::optimize()`, which in turn, typically, uses the subplex package to perform the Maximum Likelihood optimization. In such an analysis, we need to specify which parameters we want to optimize, which parameters to keep fix, and the initial values per parameter. We do so by providing the structure of the input parameters (e.g. in vector, matrix or list form), and within this structure we highlight values that stay at zero with a 0, and parameters to be inferred with indexes 1, 2, ... n. The optimizer will then use these indexes to fill in the associated parameters and perform the optimization. If this all seems a bit unclear, please continue reading and look at the fully set up parameterization for the maximum likelihood below to gain more insight. ### ETD In the ETD model, we assume that the examined trait affects diversification. In a secsse analysis we need to specify the structure of three distinct properties: the lambda list, the mu vector and the transition (Q) matrix. Each of these informs properties of the model of speciation, extinction and trait-shifts respectively. #### Lambda matrices Speciation in a secsse model is defined using a list of matrices, where each matrix highlights the state of the daughter species resulting from a speciation event. In our case, we have a trait with two states, and thus we will have to specify a list with two matrices, one for each state, where each matrix in turn will then specify the daughter states. We can do so by hand, but secsse includes functionality to do this in a more organized manner - this is especially useful if you have a trait with more than two states for instance. In this more organized manner, we can provide secsse with a matrix specifying the potential speciation results, and secsse will construct the lambda list accordingly: ```{r ETD_lambda} spec_matrix <- c() spec_matrix <- rbind(spec_matrix, c(0, 0, 0, 1)) spec_matrix <- rbind(spec_matrix, c(1, 1, 1, 2)) lambda_list <- secsse::create_lambda_list(state_names = c(0, 1), num_concealed_states = 2, transition_matrix = spec_matrix, model = "ETD") lambda_list ``` Let's see what the code has done. First, we create a `spec_matrix`, where the first column indicates the parent species (0 or 1) and the second and third column indicate the identities of the two daughter species. In this case, we choose for symmetric speciation without a change of trait, e.g. the daughters have the same trait as the parent. If you have evidence of perhaps asymmetric inheritance, you can specify this here. The fourth column indicates the associated rate indicator. In this case we choose two different speciation rates. We choose two concealed states, as it is good practice to have the same number of concealed states as observed states. The resulting `lambda_list` then contains four entries, one for each unique state (see the names of the entries in the list), that is, for each combination of observed and concealed states, where the concealed states are indicated with a capital letter. Looking at the first entry in the list, e.g. the result of a speciation event starting with a parent in state 0A, will result with rate 1 in two daughter species of state 0A as well. The way to read this, is by looking at the row and column identifiers of the entered rate. Similarly, for a speciation event starting in state 1A (`lambda_list[[2]]`), the two daughter species are 1A as well, but this time with rate 2, as we specified that species with trait 1 will have a different speciation rate. Note that here, rates 1 and 2 are ordered with the observed trait, we will later explore the CTD model, where the rates will be sorted according to the concealed state. #### Mu vector Having the speciation rates set, we can move on to extinction rates. Since we are using the ETD model, here we also expect the extinction rates to be different: ```{r ETD_mu} mu_vec <- secsse::create_mu_vector(state_names = c(0, 1), num_concealed_states = 2, model = "ETD", lambda_list = lambda_list) mu_vec ``` The function `create_mus_vector()` takes the same standard information we provided earlier, with as addition our previously made `lambda_list`. It uses the `lambda_list` to identify the rate indicators (in this case 1 and 2) that are already used and to thus pick new rates. We see that secsse has created a named vector with two extinction rates (3 and 4), which are associated with our observed traits 0 and 1. #### Transition matrix Lastly, we need to specify our transition matrix. Often, Q matrices can get quite large and complicated, the more states you are analyzing. We have devised a tool to more easily put together Q matrices. This tool starts from the so-called `shift_matrix`, the basic matrix in which we only find information on transitions between examined states. The information contained in this `shift_matrix` is then automatically mimicked for inclusion in the full matrix, to ensure that the same complexity in examined state transitions is also found in concealed states. Instead of specifying the entire `shift_matrix`, instead it suffices to only specify the non-zero transitions. In this case these are from state 0 to 1, and vice versa: ```{r ETD_Q} shift_matrix <- c() shift_matrix <- rbind(shift_matrix, c(0, 1, 5)) shift_matrix <- rbind(shift_matrix, c(1, 0, 6)) q_matrix <- secsse::create_q_matrix(state_names = c(0, 1), num_concealed_states = 2, shift_matrix = shift_matrix, diff.conceal = TRUE) q_matrix ``` Thus, we first specify a matrix containing the potential state transitions, here 0-\>1 and 1-\>0. Then, we use `create_q_matrix()` to create the q-matrix. By setting `diff.conceal` to `TRUE`, we ensure that the concealed states will get their own rates specified. Setting this to `FALSE` would set their rates equal to the observed rates (5 and 6). The way to read the transition matrix is column-row, e.g. starting at state 0A, with rate 5 the species will shift to state 1A and with rate 7 it will shift to state 0B. We intentionally ignore 'double' shifts, e.g. from 0A to 1B, where both the observed and the concealed trait shift at the same time. If you have good evidence to include such shifts in your model, you can modify the trans_matrix by hand of course. #### Maximum Likelihood We have now specified the required ingredients to perform Maximum Likelihood analyses. Prerequisites for performing Maximum Likelihood analyses with secsse are that we specify the ids of the rates we want optimized, and provide initial values. We can do so as follows: ```{r ETD_ML_init} idparsopt <- 1:8 # our maximum rate parameter was 8 idparsfix <- c(0) # we want to keep all zeros at zero initparsopt <- rep(0.1, 8) initparsfix <- c(0.0) # all zeros remain at zero. sampling_fraction <- c(1, 1) ``` Here, we specify that we want to optimize all parameters with rates 1, 2, ..., 8. We set these at initial values at 0.1 for all parameters. Here, we will only use one starting point, but in practice it is often advisable to explore multiple different initial values to avoid getting stuck in a local optimum and missing the global optimum. `idparsfix` and `initparsfix` indicate that all entries with a zero are to be kept at the value zero. Lastly, we set the sampling fraction to be c(1, 1), this indicates to secsse that we have sampled per trait all species with that trait in our dataset. Alternatively, if we know that perhaps some species with trait 0 are missing, we could specify that as c(0.8, 1.0). Thus, note that the sampling fraction does not add up to 1 across traits, but within traits. And now we can perform maximum likelihood: ```{r ETD_ML} idparslist <- list() idparslist[[1]] <- lambda_list idparslist[[2]] <- mu_vec idparslist[[3]] <- q_matrix answ <- secsse::cla_secsse_ml(phy = phylo_vignette, traits = traits$trait, num_concealed_states = 2, idparslist = idparslist, idparsopt = idparsopt, initparsopt = initparsopt, idparsfix = idparsfix, parsfix = initparsfix, sampling_fraction = sampling_fraction, verbose = FALSE) ``` We can now extract several pieces of information from the returned answer: ```{R ETD_res} ML_ETD <- answ$ML ETD_par <- secsse::extract_par_vals(idparslist, answ$MLpars) ML_ETD ETD_par spec_rates <- ETD_par[1:2] ext_rates <- ETD_par[3:4] Q_Examined <- ETD_par[5:6] Q_Concealed <- ETD_par[7:8] spec_rates ext_rates Q_Examined Q_Concealed ``` The function `extract_par_vals()` goes over the list `answ$MLpars` and places the found parameter values back in consecutive vector 1:8 in this case. Here, we find that the speciation rate of trait 1 is higher than the speciation rate of trait 0. ### CTD Let's compare our findings with a CTD model, e.g. a model centered around the concealed trait. Again, we need to specify our lambda list, mu vector and transition matrix. We will see that this is quite straightforward now that we have gotten the hang of how this works. #### Lambda matrices Again, we specify two distinct rates, indicating that the observed state inherits faithfully to the daughter species. However, this time, we set the model indicator to "CTD": ```{r CTD_lambda} spec_matrix <- c() spec_matrix <- rbind(spec_matrix, c(0, 0, 0, 1)) spec_matrix <- rbind(spec_matrix, c(1, 1, 1, 2)) lambda_list <- secsse::create_lambda_list(state_names = c(0, 1), num_concealed_states = 2, transition_matrix = spec_matrix, model = "CTD") lambda_list ``` The resulting `lambda_list` now has the chosen rates 1 and 2 sorted differently across the matrices, with matrices 1 and 2 containing rate 1, and matrices 3 and 4 containing rate 2. Looking at the column names of the matrices, states 1 and 2 are states 0A and 1A, and states 3 and 4 are states 0B and 1B, in other words, speciation rate 1 is now associated with all states with concealed state A, and speciation rate 2 is now associated with all states with concealed state B. #### Mu vector For the mu vector, we repeat the same we did for the ETD model: ```{r CTD_mu} mu_vec <- secsse::create_mu_vector(state_names = c(0, 1), num_concealed_states = 2, model = "CTD", lambda_list = lambda_list) mu_vec ``` Here, again, we see that whereas previously extinction rate 3 was associated with states 0A and 0B (e.g. all states with state 0), it is now associated with states 0A and 1A, e.g. all states associated with concealed state A. #### Transition matrix Setting up the transition matrix is not different from the ETD model, the same transitions are possible: ```{r CTD_Q} shift_matrix <- c() shift_matrix <- rbind(shift_matrix, c(0, 1, 5)) shift_matrix <- rbind(shift_matrix, c(1, 0, 6)) q_matrix <- secsse::create_q_matrix(state_names = c(0, 1), num_concealed_states = 2, shift_matrix = shift_matrix, diff.conceal = TRUE) q_matrix ``` #### Maximum Likelihood Now that we have specified our matrices, we can use the same code we used for the ETD model to perform our maximum likelihood: ```{r CTD_ML} idparsopt <- 1:8 # our maximum rate parameter was 8 idparsfix <- c(0) # we want to keep all zeros at zero initparsopt <- rep(0.1, 8) initparsfix <- c(0.0) # all zeros remain at zero. sampling_fraction <- c(1, 1) idparslist <- list() idparslist[[1]] <- lambda_list idparslist[[2]] <- mu_vec idparslist[[3]] <- q_matrix answ <- secsse::cla_secsse_ml(phy = phylo_vignette, traits = traits$trait, num_concealed_states = 2, idparslist = idparslist, idparsopt = idparsopt, initparsopt = initparsopt, idparsfix = idparsfix, parsfix = initparsfix, sampling_fraction = sampling_fraction, verbose = FALSE) ML_CTD <- answ$ML CTD_par <- secsse::extract_par_vals(idparslist, answ$MLpars) ML_CTD CTD_par spec_rates <- CTD_par[1:2] ext_rates <- CTD_par[3:4] Q_Examined <- CTD_par[5:6] Q_Concealed <- CTD_par[7:8] spec_rates ext_rates Q_Examined Q_Concealed ``` Here we now find that state A has a very low speciation rate, in contrast to a much higher speciation rate for state B (remember that speciation rate 1 is now associated with A, and not with state 0!). Similarly, extinction rates for both states are also quite different, with state A having a much lower extinction rate than state B. Examined trait shifts (`Q_Examined`) are quite low, whereas concealed trait shifts seem to be quite high. The LogLikelihood seems to be lower than what we found for the ETD model. ### CR As a check, we will also fit a model where there is no trait effect - perhaps we are looking for an effect when there is none. This is always a good sanity check. #### Lambda matrices To specify the lambda matrices, this time we choose the same rate indicator across both states. ```{r CR_lambda} spec_matrix <- c() spec_matrix <- rbind(spec_matrix, c(0, 0, 0, 1)) spec_matrix <- rbind(spec_matrix, c(1, 1, 1, 1)) lambda_list <- secsse::create_lambda_list(state_names = c(0, 1), num_concealed_states = 2, transition_matrix = spec_matrix, model = "CR") lambda_list ``` #### Mu vector The mu vector follows closely from this, having a shared extinction rate across all states: ```{r CR_mu} mu_vec <- secsse::create_mu_vector(state_names = c(0, 1), num_concealed_states = 2, model = "CR", lambda_list = lambda_list) mu_vec ``` #### Transition matrix We will use the same transition matrix as used before, although one could perhaps argue that without a trait effect, all rates in the transition matrix (both forward and reverse trait shifts) should share the same rate. Here, we will choose the more parameter-rich version (Home assignment: try to modify the code to perform an analysis in which all rates in the transition matrix are the same). ```{r CR_Q} shift_matrix <- c() shift_matrix <- rbind(shift_matrix, c(0, 1, 3)) shift_matrix <- rbind(shift_matrix, c(1, 0, 4)) q_matrix <- secsse::create_q_matrix(state_names = c(0, 1), num_concealed_states = 2, shift_matrix = shift_matrix, diff.conceal = TRUE) q_matrix ``` #### Maximum Likelihood ```{r CR_ML} idparsopt <- 1:6 # our maximum rate parameter was 6 idparsfix <- c(0) # we want to keep all zeros at zero initparsopt <- rep(0.1, 6) initparsfix <- c(0.0) # all zeros remain at zero. sampling_fraction <- c(1, 1) idparslist <- list() idparslist[[1]] <- lambda_list idparslist[[2]] <- mu_vec idparslist[[3]] <- q_matrix answ <- secsse::cla_secsse_ml(phy = phylo_vignette, traits = traits$trait, num_concealed_states = 2, idparslist = idparslist, idparsopt = idparsopt, initparsopt = initparsopt, idparsfix = idparsfix, parsfix = initparsfix, sampling_fraction = sampling_fraction, verbose = FALSE) ML_CR <- answ$ML CR_par <- secsse::extract_par_vals(idparslist, answ$MLpars) ML_CR CR_par spec_rate <- CR_par[1] ext_rate <- CR_par[2] Q_Examined <- CR_par[3:4] Q_Concealed <- CR_par[5:6] spec_rate ext_rate Q_Examined Q_Concealed ``` We now recover a non-zero extinction rate, and much higher transition rates for the concealed than for the observed states. ### Model comparisong using AIC Having collected the different log likelihoods, we can directly compare the models using AIC. Remembering that the AIC is 2k - 2LL, where k is the number of parameters of each model and LL is the Log Likelihood, we can calculate this as follows: ```{r AIC} res <- data.frame(ll = c(ML_ETD, ML_CTD, ML_CR), k = c(8, 8, 6), model = c("ETD", "CTD", "CR")) res$AIC <- 2 * res$k - 2 * res$ll res ``` I can now reveal to you that the tree we used was generated using an ETD model, which we have correctly recovered! ## Further help If after reading these vignettes, you still have questions, please feel free to create an issue at the package's GitHub repository https://github.com/rsetienne/secsse/issues or e-mail the authors for help with this R package. Additionally, bug reports and feature requests are welcome by the same means. ## References Beaulieu, J. M., O'Meara, B. C., & Donoghue, M. J. (2013). Identifying hidden rate changes in the evolution of a binary morphological character: the evolution of plant habit in campanulid angiosperms. Systematic biology, 62(5), 725-737. Beaulieu, J. M., & O'Meara, B. C. (2016). Detecting hidden diversification shifts in models of trait-dependent speciation and extinction. Systematic biology, 65(4), 583-601. FitzJohn, R. G. (2012). Diversitree: comparative phylogenetic analyses of diversification in R. Methods in Ecology and Evolution, 3(6), 1084-1092. Harmon, L. J., Weir, J. T., Brock, C. D., Glor, R. E., & Challenger, W. (2008). GEIGER: investigating evolutionary radiations. Bioinformatics, 24(1), 129-131. Rabosky, D. L., & Goldberg, E. E. (2015). Model inadequacy and mistaken inferences of trait-dependent speciation. Systematic Biology, 64(2), 340-355.