NAME : IEIRI Yuki
Affiliated academic society
: The Genetics Society of Japan
I am studing the estimation of history of population differentiation, and now developing the mothod to estimate it using hierarchical clustering and discriminant analysis.
<1>Distinguish Two-Populations and Three-Populations Differentiation
One of my purposes is to distinguish 2 models of population differentiation shown in Figure1 below, when DNA sequences are obtained which are generated under shown models.
Using computer simulation with various parameter sets (e.g. N1, T, M12), It is shown that summary statistics calculated from DNA sequences (e.g. Number of segregating sites ST, Nucleotide diversity π) have overlapping regions and nonoverlapping regions (Figure2).
It is possible to separete summary-statistics data into clusters which contain ones generated under either model and clusters which contain ones generated under both models using hierarchical clustering (Figure3a). Further, in latter case, models can be distinguished from each other, under which data are generated using Fisher's Discriminant Analysis (Figure 3b).
In order to investigate whether this method work well, I tested Clusterring and Discriminant Analysis by using resimulated DNA data.
First, as result of Clustering, 203 out of 2000 data generated under two-populations model were classified to pure two-populations cluster. This means they were correctly identified to their model. However, 27 data were into pure three-populations cluster and 1770 went to clusters that contained data of both model, so these were used to test Discriminant Analysis except 68 being in the clusters in which Fisher's Discriminant Analysis was impossible. Next, of 2000 data generated under three-populations model, 194 went to pure three populations clusters but, 14 were in pure two-populations clusters and 1792 were classified into clusters which had data of both models. 45 of 1792 belonged to clusters in which Fisher's Discriminant Analysis was impossible. (Figure 4)
Then 3449 data were used for Discriminant Analysis and the ratio of correct answers were summarized. Figure 5 shows the number and ratio of correct answers for all (4000 data), clustering and Discriminant analysis results. For 2537 data, correct models were selected by Discriminant Analysis and ratio was 73.6%. It would be not high percentage but considering that there are data for which it is impossible to choose a correct model because of similarity of models and parameters, this is not unsatisfing results. Further, an important point is that about 26% of actual data are difficult to distinguish an appropriate model.
I intend to discrimination of a model using actual DNA sequence data by this method. Thas is, which cluster actual DNA data go to and in that cluster, if discriminant function is applied to that data, which model is correct. Those results will show efficacy of this method.
<2>Distinguish Panmic, Two-Islands, Three-Islands, Admixture and Split models
Here, assume DNA samples are obtained from one population, and aim to find out which of 5 demographic models is most suitable for those samples. (Figure 6)
DNA samples were obtained by simulation and sumary-statistics were calculated from them. There are overlapping and nonoverlapping regions for 5 models (Figure 7). These distributin differently can be seperate by hierarchical clustering, then 250 clusters were obtained; some consist of data generated under a particular model, others consist of data generated under more than two models. In latter cluster, Logistic discriminant analysis was applied.
To check this method, clusters and discriminant functions are tested by new simulated DNA data.
First, it is validated whether new test data go to clusters which consist of data generated under a corresponding demographic model. Then 38%, 1876 out of 5000, go to correct clusters. however many data enter the clusters which contain data of more than two models and have to be checked by discriminant function, and 20 of them are in the cluster which do not have data of a corresponding model.(Table 1)
Discriminant functions are calculated based on rogistic regression (Figure 8). In Figure 8 below, functions are constructed using 2 variables in order to be recognizable, however actually all summary-statistics are used.
Appropriate models can be chosen for 55% (1686 of 3075) data which have to be checked by discriminant functions. The ratios of correct answers are good in case of Panmixia and Admixture models because these models are a little different from others. On the other hand, those of two-populations, three-populations and split models are relatively lower, about 50%. Tis comes from that models resemble one another and under some parameters they seem to be so similar that discrimination is impossible.
As a result, it was possible to choose correct models about 3562 of 5000 data (71%) through hierarchical clustering and logistic discriminant analysis. I intend to select a suitable model for actual organisms using this method.