ngsPool

ngsPool implements a method to estimate allele frequencies and perform various analyses from pooled sequencing data using ngsJulia. The model is described in the related paper. In addition to provide several estimators of allele frequencies, ngsPool includes scripts to estimate the site frequency spectrum and for association tests.

To showcase its use, this tutorial will simulate some pooled sequencing data and demonstrate the various options and possible analyses implemented in ngsPool.

Throughout these examples, we assume that we defined an environment variable NGSJULIA that points to the installation path. Assuming that ngsJulia is installed in a folder name ~/Software, this can be achieved temporarily wit h

NGSJULIA=~/Software/ngsJulia

or permanently with

sudo nano ~/.bashrc
export NGSJULIA=~/Software/ngsJulia
source ~/.bashrc

## Simulate pooled NGS data 

We can simulate NGS data from a pooled sequencing experiment using a script provided in `ngsJulia`.
We can explore its options:
```bash
Rscript $NGSJULIA/simulMpileup.R --help

which are also accessible on this page.

Let's assume that we wish to simulate 10 diploid genomes, 1000 base pairs each with an average sequencing depth of 20 and base quality of 20 in Phred score. Samples come from a population of 10,000 effective size under constant-size evolution. We can do that by running:

Rscript $NGSJULIA/simulMpileup.R --out test.txt --copy 2x10 --sites 1000 --depth 20 --qual 20 --ksfs 1 --ne 10000 --pool | gzip > test.mpileup.gz

and explore the files generated with:

ls test.*

The file test.txt contains the true genotypes while test.mpileup.gz is a gzipped mpileup file containing information on sequencing data. Specifically, columns in text.txt contain:
contig identifier (set to the value of --copy)
-position
-reference allele (set to A)
-alternate allele (set to C)
-population allele frequency
-genotypes
-sample allele frequency

Estimate allele frequencies with SNP calling and unknown sample size

Let's explore of ngsPool can be used to estimate allele frequencies. We can retrieve a list of all available options by typing:

julia $NGSJULIA/ngsPool/ngsPool.jl --help

The package requires a gzipped mpileup as input and the name of the output file in plain text format. Several options for data filtering are available. Let's understand its usage with several examples.

In many cases, the sample size is unknown. ngsPool provides a possibiity to obtain a maximum likelihood estimation (MLE) of the minor allele frequency under these circumstances. Using the simulated data set, we can obtain per-site MLE of allele frequencies with:

julia $NGSJULIA/ngsPool/ngsPool.jl --fin test.mpileup.gz --fout test.out.gz --lrtSnp 6.64

As an additional parameter, we specified a threshold for a Likelihood Ratio Test (LRT) for SNP calling. In this example, the choice of 6.64 corresponds to a p-value of 0.01 (3.84 and 10.83 would correspond to p-values of 0.05 and 0.001, respectively).

The ouput file can be visualised with:

zcat test.out.gz | less -S

and for each called SNP provides the following information:
-chromosome -position
-reference allele
-nonreference allele
-major allele (inferred)
-minor allele (inferred)
-lrtSNP (LRT statistic for SNP calling)
-lrtBia (LRT statistic for bialleic site calling)
-lrtTria ((LRT statistic for trialleic site calling)
-maf (estimated minor allele frequency)

The remaining columns are disabled using these options.

Estimate allele frequency without SNP calling and known sample size

With known sample size, ngsPool calculates per-site sample allele frequency likelihoods which can be used to provide estimators of allele frequency or for further downstream analyses. To this aim, the option --nChroms should be set equal to the product between ploidy and number of analysed samples. Additionally if the option --fsaf is set, a gzipped text file with sample allele frequency likelihoods is returned.

Following our previous example, we can estimate allele frequencies (and their likelihoods) with:

julia $NGSJULIA/ngsPool/ngsPool.jl --fin test.mpileup.gz --fout test.out.gz --nChroms 20 --fsaf test.saf.gz

Note that we are not doing SNP calling as --lrtSnp is not set.

The output file reports values for all sites that passed filtering:

zcat test.out.gz | less -S

and contains two additiona columns:
-saf_MLE (MLE of allele frequency from sample allele frequency likelihoods)
-saf_E (expected value of allele frequency from sample allele frequency likelihoods and uniform prior probability)

Additionally, a new file is generated:

zcat test.saf.gz | less -S

reporting the sample allele frequency log-likelihoods at each site (scaled to the maximum likelihood value). Each column corresponds to the log-likelihood value of that particular allele frequency, ranging from 0 to the maximum number of chromosomal copies (20, in this example).

Site frequency spectrum

The file containing the sample allele frequency log-likelihoods can be exploited for further downstream analyses. For instance, ngsPool provides a script to estimate the SFS with three different methods. This can be achieved with:

Rscript $NGSJULIA/ngsPool/poolSFS.R test.saf.gz > sfs.txt

The output file is accessible with

cat sfs.txt

and reports the estimated SFS based on:
-count: counting over MLE of per-site allele frequencies
-fit_count: fitting an exponential curve with counts of MLE of per-site allele frequencies
-fit_saf: fitting an exponential curve with per-site sample allele frequency likelihoods

Association test

ngsPool provides a script to calculate association tests from sample allele frequency likelihoods. Let's assume we have one target SNP and two groups, cases and controls, and we wish to test for a significant difference in allele frequencies. We simulate different allele frequencis in two groups from low-depth pooled NGS data with

# cases
Rscript $NGSJULIA/simulMpileup_qq.R --out /dev/null --copy 2x200 --sites 1 --depth 1 --qq 0.1 --pool | gzip > test.cases.mpileup.gz

# controls
Rscript $NGSJULIA/simulMpileup_qq.R --out /dev/null --copy 2x200 --sites 1 --depth 1 --qq 0.05 --pool | gzip > test.controls.mpileup.gz

The script simulMpileup_qq.R is an extention of simulMpileup.R to allow for fixed population allele frequencies to be simulated using the option --qq.

We calculate sample allele frequency likelihoods with:

# cases
julia $NGSJULIA/ngsPool/ngsPool.jl --fin test.cases.mpileup.gz --fout /dev/null --nChroms 300 --fsaf test.cases.saf.gz 2> /dev/null

# controls
julia $NGSJULIA/ngsPool/ngsPool.jl --fin test.controls.mpileup.gz --fout /dev/null --nChroms 300 --fsaf test.controls.saf.gz 2> /dev/null

These files are then used to test for association:

Rscript $NGSJULIA/ngsPool/poolAssoc.R test.cases.saf.gz test.controls.saf.gz > assoc.txt

and the resulting file is accessible with

cat assoc.txt

and shows LRT statistic and p-value (in log scale). With multiple SNPs, each test result will be shown on different lines.

Further options

All options available in ngsPool can be accessed with:

julia ngsPool/ngsPool.jl --help

usage: ngsPool.jl --fin FIN --fout FOUT [--fsaf FSAF]
                  [--nChroms NCHROMS] [--lrtSnp LRTSNP]
                  [--lrtBia LRTBIA] [--lrtTria LRTTRIA] [--minQ MINQ]
                  [--minDepth MINDEPTH] [--maxDepth MAXDEPTH]
                  [--nGrids NGRIDS] [--tol TOL]
                  [--phredscale PHREDSCALE] [--verbose VERBOSE]
                  [--printSites PRINTSITES] [-h]

optional arguments:
  --fin FIN             input file gzipped mpileup
  --fout FOUT           output file gzipped text
  --fsaf FSAF           output gzipped saf file (default: "/dev/null")
  --nChroms NCHROMS     total number of chromosomes pooled (ploidy *
                        number of individuals) [>0 ensables saf
                        likelihoods] (type: Int64, default: 0)
  --lrtSnp LRTSNP       chisquare value for SNP calling (type: Float64, default:
                        -Inf)
  --lrtBia LRTBIA       chisquare value for biallelic calling (type: Float64,
                        default: -Inf)
  --lrtTria LRTTRIA     chisquare value for triallelic (non) calling (type:
                        Float64, default: Inf)
  --minQ MINQ           minimum base quality in phredscore (type:
                        Int64, default: 5)
  --minDepth MINDEPTH   minimum global depth (type: Int64, default: 1)
  --maxDepth MAXDEPTH   maximum global depth (type: Int64, default:
                        100000)
  --nGrids NGRIDS       grid density for grid-search estimation of
                        allele frequencies (type: Int64, default: 0)
  --tol TOL             tolerance for GSS estimation of allele
                        frequencies (type: Float64, default: 1.0e-5)
  --phredscale PHREDSCALE
                        phredscale (type: Int64, default: 33)
  --verbose VERBOSE     verbosity level (type: Int64, default: 1)
  --printSites PRINTSITES
                        print on stdout every --printSites sites
                        (type: Int64, default: 10000)
  -h, --help            show this help message and exit