ngsPloidy

nsgPloidy implements a method to infer ploidy level from short-read NGS data using ngsJulia. The model is described in the related paper. In a nutshell, the algorithm calculates the likelihood of ploidy levels from genotype likelihoods and genotype priors. The latter can be estimated either from the data or, with limited sample size, from a population genetic model.

ngsPloidy takes gzipped mpileup files as input and returns several metrics of per-sample ploidy assignment. It calculates the likelihood of NGS data given each tested ploidy, with genotype probabilities calculated from their likelihoods and expected (or estimated) allele frequencies. ngsPloidy outputs the most likely vector of marginal ploidies, as well as the likelihood of all samples having the same ploidy. It also tests for multiploidy within the sample, i.e. it provides statistical support for the ploidy being different among all tested samples. ngsPloidy supports data filtering based on quality and depth values.

Throughout these examples, we assume that we defined an environment variable NGSJULIA that points to the installation path. Assuming that ngsJulia is installed in a folder name ~/Software, this can be achieved temporarily with

NGSJULIA=~/Software/ngsJulia

or permanently with

sudo nano ~/.bashrc
export NGSJULIA=~/Software/ngsJulia
source ~/.bashrc

Case studies

Let's investigate several case studies to understand how ngsPloidy works. We can run

julia $NGSJULIA/ngsPloidy/ngsPloidy.jl --help

to see all options available.

Case A: 4 haploids, 4 diploids, 4 triploids, 4 tetraploids, 4 pentaploids

We can simulate sequencing data in mpileup format by specifying the ploidy of each individual and other parameters of the sequencing experiment and species. We can use the script provided (which requires getopt package):

Rscript $NGSJULIA/simulMpileup.R --help

We simulate 100 sites all polymorphic in the population with:

Rscript $NGSJULIA/simulMpileup.R --out test.A.txt --copy 1x4,2x4,3x4,4x4,5x4 --sites 100 --depth 20 | gzip > test.A.mpileup.gz

The true data is contained in this file:

zcat test.A.txt | less -S

where each column indicates:
-identifier of contig (named after the --copy option)
-position
-reference allele (set to A)
-alternate allele (set to C)
-population allele frequency
-genotype for each sample
-sampled allele frequency

The simulated observed sequencing data is accessible with:

zcat test.A.mpileup.gz | less -S

and it is formatted as standard mpileup file.

If we assume we have enough sample size, we can use the estimated allele frequency to calculate genotype probabilities at each site. In this case, we can simply infer ploidy levels with:

julia $NGSJULIA/ngsPloidy/ngsPloidy.jl --fin test.A.mpileup.gz --nSamples 20 > test.A.out

and access the output file as:

cat test.A.out

Results show:
-nr of analysed sites: vector of sites that passed filtering for each sample
-log-likelihoods of per-sample ploidies: a matrix of nr_sample X nr_ploidies with the log-likelihood of each sample having a certain ploidy (rows are separated by ;)
-MLE vector of ploidies: the vector of individual maximum likelihood estimates of ploidy for each sample
-log-likelikehood of MLE vector of ploidies: the log-likelihood of the above vector of estimated ploidies
-LRT of multiploidy: difference between MLE vector of ploidies and the log-likelihood of all samples having the same ploidy, calculated for all tested ploidy levels

Note that by default all ploidy levels from 1 to 8 are tested, and therefore for each sample 8 likelihoods are calculated and reported. To extract the interpretation of these results, we can run the following script:

Rscript $NGSJULIA/ngsPloidy/ploidyLRT.R --in test.A.out --out test.A.ploidy.out
cat test.A.ploidy.out

which outputs the most likely ploidy with its statistical support and the result for the test of multiploidy. In this example, we can see that the vector of estimated ploidy levels is equal to the simulated one. We further observe that the LRT value for multiploidy are high, further suggesting variation in ploidy among samples. In fact, the most likely vector of equal ploidy levels is 4 but it has less support than the inferred vector of multiploidy.

Case B: 2 haploids, 2 diploids, 2 triploids, 2 tetraploids, 2 pentaploids

This example is similar to Case A but with half of the samples. We simulate this scenario with:

Rscript $NGSJULIA/simulMpileup.R --copy 1x2,2x2,3x2,4x2,5x2 --sites 100 --depth 20 | gzip > test.B.mpileup.gz

With limited sample size we can use a different estimation of population allele frequency which will be constant across all sites. In case of limited sample size, firstly we need a create a file containing genotype probabilities. We also need to provide the probability of the major allele being ancestral, as this information will be used in case of limited sample size. These probability files can be generated using the following R script (which requires getopt package):

Rscript $NGSJULIA/ngsPloidy/writePars.R --help

Further documentation is available here.

If we assume that we know the ancestral state and it is equivalent to the reference allele, we can run:

Rscript $NGSJULIA/ngsPloidy/writePars.R -k 1 -n 10000 -p 1 > test.pars

julia $NGSJULIA/ngsPloidy/ngsPloidy.jl --fin test.B.mpileup.gz --fpars test.pars --nSamples 10 --keepRef 1 > test.B.out

Rscript $NGSJULIA/ngsPloidy/ploidyLRT.R --in test.B.out --out test.B.ploidy.out

The option --keepRef forces the reference allele to be one of the two considered alleles and it is mandatory with --fpars.

As an example,

cat test.pars

outputs in the first line the probability of major allele being ancestral and derived, while in the remaining lines it shows the probabilities for diallelic genotypes for each tested ploidy (up to ploidy 8).

If --fout is given (optional), ngsPloidy prints some statistics for each site, including the estimated allele frequency:

julia $NGSJULIA/ngsPloidy/ngsPloidy.jl --fin test.B.mpileup.gz --fpars test.pars --fout test.B.out.gz --nSamples 10 --keepRef 1

with the output file accessible with:

zcat test.B.out.gz | less -S

Specifically, this file reports:
-chrom: chromosome
-pos: position
-ref: reference allele
-depth: sequencing depth
-ref/anc: reference or ancestral allele (inferred)
-alt/der: alternate or derived allele (inferred)
-lrtSnp: LRT for the site being a SNP
-lrtBia: LRT for the site being biallelic
-lrtTria: LRT for the site being triallelic
-aaf: estimated alternate or ancestral allele frequency

Note that in this case results are printed on the screen. Also note that if --fpars is not provided, --fout will show the estimate of the minor allele frequency (maf) instead.

We can also change the genotype probabilities in input. For instance, we can impose an automatic setting of the probability of major allele being the ancestral state with:

Rscript $NGSJULIA/ngsPloidy/writePars.R -k 1 -n 10000 -p -1 > test.auto.pars

julia $NGSJULIA/ngsPloidy/ngsPloidy.jl --fin test.B.mpileup.gz --fpars test.auto.pars --fout test.B.out.gz --nSamples 10 --keepRef 1

One additional possibility is to not impose any genotype probability based on the site frequency spectrum (with --fpars) and use a uniform probability distribution instead (--unif 1). With this option you are required that there are is no missing data at each sample (i.e. --minSamples should be equal to --nSamples), otherwise the program will throw an error.

julia $NGSJULIA/ngsPloidy/ngsPloidy.jl --fin test.B.mpileup.gz --unif 1 --nSamples 10 --minSamples 10

Finally, we can even infer ploidy by assigning genotypes (--callGeno) and consider their likelihoods only in the ploidy estimation with:

julia $NGSJULIA/ngsPloidy/ngsPloidy.jl --fin test.B.mpileup.gz --unif 1 --callGeno 1 --nSamples 10 --minSamples 10

Please note that we obtained the same inferred vector of ploidy levels in all these different options. However, results may vary significantly between different settings depending on sequencing depth, sample size, and number of sites. For instance, calling genotypes is not recommended for low-depth data.

Case C: 10 triploids and no output file, simulating an error rate of assigning the ancestral state of 0.10

We can simulate this scenario with:

Rscript $NGSJULIA/simulMpileup.R --out test.C.txt --copy 3x10 --sites 100 --depth 20 --panc 0.1 | gzip > test.C.mpileup.gz

As there is uncertainty in the allelic polarisation (and limited sample size), we can incorporate such error in the genotype probability file previously calcolated. As a further illustration, we will filter out bases with a quality lower than 15 in Phred score.

Rscript $NGSJULIA/ngsPloidy/writePars.R -k 1 -n 10000 -p 0.90 > test.unk.pars

julia $NGSJULIA/ngsPloidy/ngsPloidy.jl  --fin test.C.mpileup.gz --fpars test.unk.pars --nSamples 10 --keepRef 1 --minQ 15 > test.C.out

Rscript $NGSJULIA/ngsPloidy/ploidyLRT.R test.C.out

As we can evince from the results, we fail to reject the null hypothesis of equal ploidy across all samples, as desired given the simulated scenario.

If -fglikes is given (optional), the program ouputs the per-site genotype likelihoods for each tested ploidy,

julia $NGSJULIA/ngsPloidy/ngsPloidy.jl --fin test.C.mpileup.gz --fpars test.unk.pars --fglikes test.C.glikes.gz --nSamples 10 --keepRef 1 --minQ 15

with the output file accessible with:

zcat test.C.glikes.gz | less -S

where genotype likelihoods (assuming diallelic variation) for all tested ploidy are provided on each line.

Case D: 1 diploid, 8 triploids, 1 tetraploid with and folded allele frequencies

This scenario can be simulated with:

Rscript $NGSJULIA/simulMpileup.R --out test.D.txt --copy 2x1,3x8,4x1 --sites 100 --depth 20 | gzip > test.D.mpileup.gz

As the data is folded (we have no information on which allele is ancestral or derived), we can use the appropriate genotype probability previously calculated:

Rscript $NGSJULIA/ngsPloidy/writePars.R -k 1 -n 10000 -p 0.5 > test.fold.pars

julia $NGSJULIA/ngsPloidy/ngsPloidy.jl --fin test.D.mpileup.gz --fpars test.fold.pars --nSamples 10 --keepRef 1 > test.D.out

Rscript $NGSJULIA/ngsPloidy/ploidyLRT.R test.D.out

From the results, multiploidy is statistically supported.

Case E: 1 tetraploid with Ne=1e6, experiencing population growth with SNP calling

In this case we have only one sample and we will attempt to infer its ploidy for called SNPs. Let's assume we have 1000 sites polymorphic in the population. This scenario can be simulated with:

Rscript $NGSJULIA/simulMpileup.R --copy 4x1 --sites 1000 --depth 30 --ksfs 0.90 --ne 100000 | gzip > test.E.mpileup.gz

Note that, as an illustration, we are not generating an output file for the real data.

As we wish to call SNPs, we need to indicate the appropriate file for genotype probabilities and a threshold for SNP calling (--lrtSnp). The former can be obtained with:

Rscript $NGSJULIA/ngsPloidy/writePars.R -k 0.9 -n 100000 -p 1 -s > test.snp.pars

The latter is in chi-square score value where, for instance, 6.64 corresponds to a p-value of 0.01.

julia $NGSJULIA/ngsPloidy/ngsPloidy.jl --fin test.E.mpileup.gz --fpars test.snp.pars --keepRef 1 --nSamples 1 --lrtSnp 6.64 > test.E.out

Rscript $NGSJULIA/ngsPloidy/ploidyLRT.R test.E.out

Please note that, when SNP calling in --fpars is used, haploid likelihoods cannot be calculated. Also note that in this case the number of analysed sites is less that the number of simulated sites because only SNPs are considered.

Further options

Several filtering options on base quality, depth and proportion of minor reads are available. Likewise, options to include only bialleic sites (i.e. exclude triallelic and multiallelic sites) can be used. Results may vary depending on the filtering options and users are encourage to consider how their inferences are robust to the data processing pipeline.

All options in ngsPloidy can be retrieved by:

julia $NGSJULIA/ngsPloidy/ngsPloidy.jl --help

usage: ngsPloidy.jl --fin FIN [--fpars FPARS] [--fout FOUT]
                    [--fglikes FGLIKES] --nSamples NSAMPLES
                    [--ploidy PLOIDY] [--keepRef KEEPREF]
                    [--callGeno CALLGENO] [--unif UNIF]
                    [--lrtSnp LRTSNP] [--lrtBia LRTBIA]
                    [--lrtTria LRTTRIA] [--minMaf MINMAF]
                    [--minQ MINQ]
                    [--minNonMajorCount MINNONMAJORCOUNT]
                    [--minNonMajorProportion MINNONMAJORPROPORTION]
                    [--minGlobalDepth MINGLOBALDEPTH]
                    [--maxGlobalDepth MAXGLOBALDEPTH]
                    [--minSampleDepth MINSAMPLEDEPTH]
                    [--maxSampleDepth MAXSAMPLEDEPTH]
                    [--minSamples MINSAMPLES] [--nGrids NGRIDS]
                    [--tol TOL] [--phredscale PHREDSCALE]
                    [--verbose VERBOSE] [--debug DEBUG]
                    [--printSites PRINTSITES] [-h]



optional arguments:
  --fin FIN             input file gzipped mpileup
  --fpars FPARS         pars (pANC and geno priors) file (default:
                        "NULL")
  --fout FOUT           output file gzipped text (default: "NULL")
  --fglikes FGLIKES     genotype likelihoods file gzipped text
                        (default: "NULL")
  --nSamples NSAMPLES   number of samples (type: Int64)
  --ploidy PLOIDY       ploidies to be tested, e.g. 2-5 or 1,3-5
                        (default: "1-8")
  --keepRef KEEPREF     keep reference as one possible allele (type:
                        Int64, default: 0)
  --callGeno CALLGENO   call genotypes (highest likelihood/posterior
                        probability) (type: Int64, default: 0)
  --unif UNIF           use a uniform prior for genotype probabilities
                        (type: Int64, default: 0)
  --lrtSnp LRTSNP       chisquare for SNP calling (type: Float64,
                        default: -Inf)
  --lrtBia LRTBIA       chisquare for biallelic calling (type:
                        Float64, default: -Inf)
  --lrtTria LRTTRIA     chisquare for triallelic (non) calling (type:
                        Float64, default: Inf)
  --minMaf MINMAF       minimum allele frequency for SNP calling
                        (type: Float64, default: -Inf)
  --minQ MINQ           minimum base quality in phredscore (type:
                        Int64, default: 13)
  --minNonMajorCount MINNONMAJORCOUNT
                        minimum non major base count (type: Int64,
                        default: 0)
  --minNonMajorProportion MINNONMAJORPROPORTION
                        minimum non major base proportion (type:
                        Float64, default: 0.0)
  --minGlobalDepth MINGLOBALDEPTH
                        minimum global depth (type: Int64, default: 1)
  --maxGlobalDepth MAXGLOBALDEPTH
                        maximum global depth (type: Int64, default:
                        10000)
  --minSampleDepth MINSAMPLEDEPTH
                        minimum sample depth (type: Int64, default: 0)
  --maxSampleDepth MAXSAMPLEDEPTH
                        maximum sample depth (type: Int64, default:
                        10000)
  --minSamples MINSAMPLES
                        minimum number of valid samples to retain site
                        (type: Int64, default: 1)
  --nGrids NGRIDS       grid density for grid-search estimation of
                        allele frequencies (type: Int64, default: 0)
  --tol TOL             tolerance for GSS estimation of allele
                        frequencies (type: Float64, default: 1.0e-5)
  --phredscale PHREDSCALE
                        phredscale (type: Int64, default: 33)
  --verbose VERBOSE     verbosity level (type: Int64, default: 0)
  --debug DEBUG         debug for 1 sample (type: Int64, default: 0)
  --printSites PRINTSITES
                        print on stdout every --printSites sites
                        (type: Int64, default: 10000)
  -h, --help            show this help message and exit