CLC Genomics Workbench

SNP detection

Instead of manually checking all the conflicts of a contig to discover significant single-nucleotide variations, CLC Genomics Workbench offers automated SNP detection.

The SNP detection in CLC Genomics Workbench is based on the Neighborhood Quality Standard (NQS) algorithm of [Altshuler et al., 2000] (also see [Brockman et al., 2008] for more information).

Based on your specifications on what you consider a valid SNP, the SNP detection will scan through the entire contig and report all the SNPs that meet the requirements.

Screenshot 1

Assessing the quality of the neighborhood bases

The SNP detection algorithm will look at each position in the contig to determine if there is a SNP at this position. In order to make a qualified assessment, it also considers the general quality of the neighboring bases. Therefore a Window size is defined to determine how far away from the current position this quality assessment should extend.

Screenshot 2: An example of a window size of 11 nucleotides.

For each read and within the given window size, the following two parameters are used to assess the quality:

  • Minimum average quality of surrounding bases: The average quality score of the nucleotides in a read within the specified window length has to exceed this threshold for the base to be included in the SNP calculation for this position.
  • Max. number of gaps and mismatches: The number of gaps and mismatches allowed within the window length of the read. If there are more gaps or mismatches, this read will not be included in the SNP calculation at this position. Unaligned regions (the faded parts of a read) also count as mismatches, even though some of the bases match.

The following vertical SNP measurements are used:

  • Minimum coverage: If SNPs were called in areas of low coverage, you would get a higher amount of false positives. Therefore you can set the minimum coverage for a SNP to be called. Note that the coverage is counted as the number of valid reads at the current position (i.e. the reads remaining when the quality assessment has filtered out the bad ones).
  • Minimum variant frequency (%): If only one read has a variant base, you may not want this to count as a SNP. This threshold is used to determine the minimum frequency for a variant to be called a SNP. Per default, the value is set to 60%, which means that there should be a variant base in at least 60% of the bases in the valid reads before a SNP is called. Note that if you have two different variants with each having e.g. 30% frequency, it will not be counted as a SNP. If you sequence diploid genomes, you may have to lower this value to detect all SNPs.