In the previous workshops you have done the following stages using Illumina sequence data:
In this session you will
Use the variants (SNPs and indels) contained in a VCF file to carry out some population genomic analyses to:
The data you will be analysing are from clinical isolates of Leishmania donovani from Ethiopia. Normally you would have to first carry out the usual initial steps of cleaning sequences, aligning to the reference genome, SAM/BAM file manipulation, and variant calling. To save time, we have done these steps for you, and you will be working on the resulting VCF file.
Documenting your work is important.
Keep recording your commands in a plain text editor such as Notepad, and keep backing this up.
As part of your project, you will be required to hand in your notes. So get into the habit now
Log on to your designated server as you did in Workshops 2 and 3.
If your userid is in the alphabetic range ag1314 to ja1082, you can run the analysis for the workshops (and your projects) on the server called biollogin.york.ac.uk
If your username is in the alphabetic range jdh562 to xc1051, you will have you run your analysis for the workshops (and your projects) on a different server called biolnode0.york.ac.uk. To do this, first login in to biollogin, and then connect to biolnode0:
If you have forgotten how to do this, look back at Workshop 2 and Workshop 3.
Once you are logged into your designated server, use the cd command to navigate to the analysis directory you made last time.
cd /scratch/userid/something
Load the modules (i.e. the programs that you will use)
source /opt/yarcc/data/biology/mbiol_sequence_analysis_2019/biolprograms
As in the previous workshop, open a second connection to your designated server in which you can run top to keep an eye on tasks you are running.
If you have forgotten how to do this, look back at Workshop 2 and Workshop 3.
It is important that you check whether any job you have started has completed before trying to run it again (otherwise you may end up running multiple instances of the same command, each of which is trying to write to the same output files). To do this you should:
1. Check your window running top to see whether or not your job is still running
2. See if any output files have been created and whether they contain anything; ls -lt will list files in chronological order with the newest at the top and show the file size.
3. Check the contents of any log files that may be generated.
The VCF file you will need to complete today’s workshop is here: /opt/yarcc/data/biology/mbiol_sequence_analysis_2019/L_donovani_all_samples.vcf.gz
Copy this to your current directory (/scratch/userid/something)
Note that this is a single gz file. To uncompress this do:
gunzip L_donovani_all_samples.vcf.gz
(In previous workshops you have had to copy over and uncompress a tar.gz file, which is a collection of compressed files.)
Your directory should now contain a VCF file called L_donovani_all_samples.vcf containing variants (SNP and indels) genotyped across multiple Leishmania strains.
Have a look at the contents of the VCF file you have copied over.
Find out how many strains have been sequenced.
How many variants have been called?
The SNP and indel calling and Filtering SNP and indel calls sections from Workshop 3 will help you.
Optional:
Inferring ancestry differences among individuals from different populations, or identifying population structure has been motivated by many applications:
Population genetics
Genetic association studies
Prediction of whether an individual will have particular traits (for instance in people that have the duffy antigen that allows them to have resistance to malaria)
Until recently we were only able to group individuals into populations using a few genetic markers. However, the decrease in cost and improvements in technology mean that is now feasible to use whole genomes to perform these kinds of analyses. This means that we can now have very fine scaled analyses of populations based on many thousands of markers.
Principal components analysis (PCA) is the most commonly recognised and used method for identifying population structure.
This analysis allows us to calculate principal components (PCs) that explain the differences between individuals in genetic data. The naming of each of these components is based on the percentage of variation in the data they explain, with principal component 1 (PC1) being the most explanatory PC of the data. For instance if PC1 explains 61%, this means that all of the other PCs account for 39% of the variance observed in the data. PC2 will have the next largest contribution to the genetic variance, for example PC2 may contribute 20%. The overall variance should add up to 100%, and so if PC1 and PC2 explain 81% of the variance, the other PCs will explain the remaining 19%.
You will be using plink to investigate population structure of Ethiopian Leishmania isolates using principal components analysis, and then using R to plot the results to visualise any structure that may be present.
The first thing we need to do is made a BED file and the associated plink format files using the following command in plink:
plink --vcf L_donovani_all_samples.vcf --make-bed --chr-set 36 no-xy --out L_donovani_all_samples
The –make-bed command means that the vcf file information will be used to create a bed file and plink specific files it needs to perform other functions. Plink was based originally on human data, and so it expects a X/Y chromosome and 23 chromosomes. Leishmania has has 36 chromosomes, and no sex chromosomes, which we specify with the –chr-set and no-xy options respectively. The –out option specifes the prefix of the output files.
You should now have files with a suffix bed, fam, bim, log and nosex.
Using these plink files, we can now carry out the PCA:
plink --bfile L_donovani_all_samples --pca --chr-set 36 no-xy --out L_donovani_all_samples
Here the –bfile option means use bed, fam and bim files with the prefix L_donovani_all_samples. The chr-set and no-xy are also needed here. The –pca part of the command means generate eigenvalues
You will notice that previous plink command will have generated two new output files; ending with eigenvec and eigenval. Have a look at the contents of these files.
The eigenval file tells you the order of each PC ( so PC1, PC2….) and the percentage of variance explained by each PC.
The eigenvec file contains the coordinates for each sample along each PC. This file has no headers, is space separated, and contains the sample name in columns one and two, and then subsequently a column of co-ordinates for each PC.
The contents of the eigenvec file can be used to generate a plot in R
You will need to transfer the eigenvec file that you have created from the Linux server to your local computer using WinScp. To remember how to do this, look at workshop 2
This section will be using R studio.
First, open R Studio and start a new R script.
Then, check what the current working directory is by typing the following command in the console window.
getwd()
This will give us the path that R will currently be loading its data from. If this is not the folder that your eigenvec file is in then you you can change the folder location:
setwd("/Users/Sarah/Downloads")
By default, the eigenvec file that we have is seperated by spaces. You need to tell R this, so that it can recognise that there are multiple columns in the file. You can specify this with sep=" " when we read the file in. In the following, we will be saving the information from the file into an object called pca1 using the “<-” symbol. (Of course, like all object names in R, you can call it anything you like. You could call it chocolate_fish, but that wouldn’t help you to remember what the data object contains).
pca1 <- read.table("L_donovani_all_samples.eigenvec",sep=" ",header=F)
There are several important things to note here. The file name needs to be enclosed within double quotes, the sep command specifies how columns are defined, and header=F means that the first line of the file is not column names (otherwise the first sample and its eigenvalues will be used as column names).
We can then plot by the first and second principal components using the plot command. Because we have no headers in this file, R arbitrarily gives these V numbers, so the first two columns are called V1 and V2. These contain the sample names, V3 corresponds to PC1, V4 to PC2 and so on. We can plot the first and second PCs using the following command
plot(data=pca1, V3~V4)
Now try plotting some of the other PCs.
Questions to think about
How many genetic clusters are present across the sequenced Leishmania strains? A cluster can be defined as a group of strains that are genetically very similar to each other. Each cluster is genetically distinct from other clusters.
What is the relationship among these genetic clusters? i.e. which cluster is the most genetically distinct? Which genetic clusters are most similar to one another?
How many PCs is it worth plotting?
Before proceeding to the next session, see if you can identify which strains belong to each of the genetic clusters you have identified. There are a variety of ways to do this. For example, you could open the eigenvec file in Excel and sort the columns by PC. You will need this cluster information in the next section.
We will next use the VCF file to examine the genomic diversity of the population. Remember that each row of a VCF file describes one genetic variant in the population, and the columns describe the allele calls in each individual.
See the Wikipedia VCF page here, if you need to remind yourself about VCF files.
Analysis of genetic diversity allows us to describe populations, selection and some aspects of epidemiology. We often use the number of SNPs and INDELs as a way of determining how diverse individuals are from each other.
However there are more robust measures of genetic diversity that allow us to compare how varied individuals are, and to compare these values to other samples directly, and takes into account differences in the number of samples, or genome size.
Nucleotide diversity (often referred to using the symbol π) is the average pairwise difference between all possible pairs of individuals in your sample. It is a very intuitive and simple measure of genetic diversity, and is accurately estimated even with very few samples. A formal definition is here.
There is usually significant variation in nucleotide diversity along a genome.
Nucleotide diversity could be estimated at each variant along a genome, however such an estimate would be unreliable. A more accurate estimate of nucleotide diversity along a genome can be obtained by calculating nucleotide diversity along a section of the genome (i.e. averaging across several variants).
We can obtain the nucleotide diversity (π) from our VCF file using vcftools. In our case we will estimate the π values from 10 kb (10,000 bp) windows across the genome.
NB: vcftools is a very flexible package for analysing and manipulating VCF files. It can do many other wonderful things. The vcftools manual is on github here.
To calculate π over 10kb (10000bp) windows of the genome, do this:
vcftools --vcf L_donovani_all_samples.vcf --window-pi 10000 --out L_donovani_all_samples_10kb
Have a look at the tab separated file called L_donovani_all_samples_10kb.windowed.pi. You will see that estimates of π will have been made every 10,000 bases across the genome. There is also information about the genomic position of each window, as well as the number of variants used to make each π estimate.
Note how we have specified the output prefix to include useful information such as the window size, and the samples used.
Later, you will load this output file into R, and take a look at the variation in π along chromosomes.
We can also calculate statistics which allow us to see whether a population is neutrally evolving or whether it is under selective pressures. One of the measures we can use to test for this is Tajima’s D.
Remember that Tajima’s D will change depending on population size changes, and selection.
We can use vcftools again to calculate Tajima’s D for us, in 10kb windows accross the genome, like so:
vcftools --vcf L_donovani_all_samples.vcf --TajimaD 10000 --out L_donovani_all_samples_10kb
Have a look at the L_donovani_all_samples_10kb.Tajima.D output file produced, and remind yourself what negative and positive values of Tajima’s D represent. You will be plotting the variation in Tajima’s D along chromosomes using R.
Another important measure used in population genetics is Fst. The fixation index (Fst) is a measure of population separation. Fst ranges from 0 to 1. If Fst = 0, there is no genetic distance between populations. If Fst = 1, the two populations are very strongly separated.
You can use Fst to identify the relationship between individuals, and it can be used to identify populations. In vcftools you can make pairwise comparisons between populations or even between individuals. However, you have already examined population structure using PCA.
But genetic divergence between populations may also vary along a genome. For example, two populations may be genetically very similar across much of their genomes except at a few strongly divergent regions. These divergent regions may therefore contain adaptive genes enabling the two populations to adapt to different environments. So instead it would be interesting to examine if genetic divergence between populations varies along the genome.
To do this, you first need to define the members of each of your populations. You did this at the end of the PCA section where you identified cluster membership.
For each of your clusters/populations, create a file (named something sensible like population1, population2 etc) containing a list of cluster members like this (you can use nano do to this):
ERR304762
ERR304764
ERR304761
ERR304763
If you have not been able to define your populations based on the initial PCA analysis, you can download population files from /opt/yarcc/data/biology/mbiol_sequence_analysis_2019/Leish_populations.tar.gz
To calculate variation in genetic divergence (Fst) between the members of population1 and population2 in 10kb windows across the genome you need to do this:
vcftools --vcf L_donovani_all_samples.vcf --weir-fst-pop population1 --weir-fst-pop population2 --fst-window-size 10000 --out pop1_vs_pop2_FST_10kb
Have a look at the output file pop1_vs_pop2_FST_10kb.windowed.weir.fst
Now you are ready to load some data into R, and plot your results.
You will first need to copy over your pi, Tajima’s D and Fst output files from the server to your local computer using WinScp.
L_donovani_all_samples_10kb.windowed.pi
L_donovani_all_samples_10kb.Tajima.D
pop1_vs_pop2_FST_10kb.windowed.weir.fst
To read the data in the file into an object called pi.all, type this command in the console:
pi.all <- read.table("L_donovani_all_samples_10kb.windowed.pi",header=T)
We can then make a histogram of π, like so:
hist(pi.all$PI,br=20)
This isn’t very informative, because most sites have relatively low diversity. So let’s make a box and whiskers plot, like so:
boxplot(pi.all$PI,ylab="diversity")
Now let’s look at π along chromosome 12. To do this, lets first make a subset of the pi.all object:
pi.chr12 <- subset(pi.all, CHROM == "chr12")
This command means, take the subset of lines in the pi.all object, where the CHROM column is equal to chr12.
Try using using the commands below to see what you have done.
#find out how many rows your objects have
nrow(pi.chr12)
nrow(pi.all)
#see thr first and last lines
head(pi.chr12)
head(pi.all)
tail(pi.chr12)
tail(pi.all)
#see a summary of the data frame
summary(pi.chr12)
summary(pi.all)
Now, let’s make the plot.
Let’s plot the π value (on the y axis) against the position on the chromosome along the x axis. The position is in the column of the pi.chr12 data frame called BIN_START. We plot like so:
plot(pi.chr12$BIN_START,pi.chr12$PI,xlab="position",ylab="diversity")
Plot π for another chromosome.
Read the data in the file into an object called taj.all:
taj.all <- read.table("L_donovani_all_samples_10kb.Tajima.D",header=T)
You can then make a histogram of Tajima’s D:
hist(taj.all$TajimaD,br=20)
How does distribution compare to the expected distribution of Tajima’s D?
Now plot Tajima’s D along a chromosome 36. To do this, first make a subset of the taj.all object like you did before for π.
Then you can plot the Tajima’s D value (on thr y axis) against the position on the chromosome along the x axis:
plot(taj.chr36$BIN_START,taj.chr36$TajimaD,xlab="position",ylab="Tajima's D")
Plot Tajima’s D for another chromosome.
By re-using these commands you can now plot the Fst distribution and then plot Fst along a chromosome.
Questions and additional analyses
What influences the choice of window size in these analyses?
What is the issue with using too narrow a window size?
What is the issue with using too wide a window size?
Investigate what the results look like when you use smaller and larger windows.
Compare the variation in Fst across the genome using population pairs that are strongly and weakly divergent (based on the inital PCA)
You should now know how to use the variants contained in a VCF file to:
carry out a PCA to examine population structure using plink.
estimate the populations genomic parameters Fst, pi and Tajima’s D using vcftools.