Polonator Quality Data (one lane)
Data courtesy of Jeremy Edwards, UNM
Download Polonator Quality Data (Excel File)
Top Level: The Polonator base caller outputs a separate read file for each lane, with each read file containing ~ 20 million reads. Within every read, each base call is assigned a quality score. These quality scores range from 0 (highest quality) to 999 (lowest quality). An aligner then attempts to map each read back to the reference genome. The data file in this spreadsheet contains data on the reads from a single flow cell lane that successfully mapped back to the reference genome (in this case, Group A Streptococcus pyogenes). As reads are mapped back to the reference genome, base call mismatches occur, and have been tabulated in this spreadsheet, as described below:
- The total number of base call mismatches in this lane of data is shown in cell B2 (~ 7.7 million).
- The total number of base call matches is shown in cell C2 (~ 290 million).
- Column A is simply a list of all possible quality scores (0 to 999), where again, the lowest number corresponds to the highest quality.
- Column B is a list of the number of base calls for a given quality score that were a mismatch to the reference genome. While some of these mismatches are undoubtedly SNPs, we have assumed for the purpose of this analysis that all mismatches are errors.
- Column C is a list of the number of base calls for a given quality score that were a match to the reference genome.
- Column D is the error rate for each specific quality score, calculated as B / (B + C).
- Column E is the raw accuracy for each specific quality score (1 - D), expressed as a percentage.
- Column F is the mean of the individual accuracies (from Column E) for all quality scores from 0 up to any given quality score. This is useful in determining the cumulative accuracy for all quality scores below (better than) a given threshold.
- Column G is the percentage of all calls with that quality score or lower (better), to the total number of calls.
- For example, for the quality score 400, 91.52% of all base calls with this score were a match to the reference genome. For all calls with quality scores from 0 to 400, the mean accuracy was 98.29%. The percentage of all calls with quality scores of 400 or better (lower) to the total number of calls was 90.96%.
- This spreadsheet evaluates the effectiveness of the Polonator quality scores in predicting match / mismatch. We conclude that the Polonator quality scores provide an excellent indicator of base call matching.
Graphs: In the upper of the two graphs above, the number of base call matches and mismatches are plotted as a function of Quality Score. Matches are plotted in red, using the left hand vertical scale, while mismatches are plotted in blue, using the right hand scale. Note that the right hand scale is forty times smaller than the left hand scale. In the lower graph, the cumulative accuracy is plotted on the Y axis, versus the cumulative fraction of base calls with that quality score or better on the X axis.
Quality Score Details: The Polonator captures and processes four-color images of fluorescent beads in real time. Ideally, each bead will be dim in three of the four fluorescent colors, and bright in one, allowing an accurate base call to be made. The Polonator Image Processing software presents the data for each image in the form of a tetrahedron plot. In these plots, each bead has a four dimensional value corresponding to its normalized fluorescent intensity in each of the four colors, and beads with high quality will clump near one of the four vertices of the tetrahedron. The Polonator Image Processing software then assigns a quality score to each base call. These quality scores range from 0 to 999, and represent the distance in the normalized four dimensional fluorescence intensity space from that base call, to the centroid of all base calls of its called color. Accordingly, calls with the lowest quality score lie closest to the centroid of a specific color, and represent high quality, while the largest quality scores lie far away from any color centroid, indicate low quality, and likely reflect either non-clonality, or marginally amplified beads.
Run details: The base calls in this run each reside within a 28 base (14 + 14) paired end read. The standard Polonator chemistry provides a 26 base (13 + 13) paired end read, but in this run the sequencing was extended to include an additional base on the minus (bead proximal) side of each tag. The Polonator features two flow cells, each with eight lanes that can be separately loaded with individual sample libraries. Each lane is loaded with ~180 million beads, of which about half both bind to the underside of the flow cell's glass cover slip, and are recognized as individual entities by the Polonator Object Finder. Of the ~ 90 million beads per lane identified by the object finder, about 40% (~ 36 million) will have been amplified. Of these, about half, or 18 million, will pass the filter set within the base caller. Of these reads, around half will typically map back to the reference genome, resulting in about 9 million mappable reads per lane at present. Depending on library titration and technique, the typical range is from 7 to 12 million mappable reads per lane, or about 160 million reads per dual flow cell run. With the read length in this example of 28 bases, the run output is about 4.5 G bp. Library beads can be optionally enriched, removing the unamplified beads; this effectively doubles the run output to ~ 9 G bp, although the protocol is somewhat expensive, and does not reduce the cost per base.
Alignment Software Details: The aligner used to map the data from this run back to its reference genome was written by a group at The University of New Mexico, led by Jeremy Edwards. The aligner, which accommodates the gap present in the current Polonator paired end tags, is being submitted for publication, and is not yet available for download. The aligner makes from one to four successive passes on each read in the read file generated by the Polonator basecaller.
- If the read maps back on any pass, it is scored as a mappable read, and no further passes are performed on that read.
- In the first pass, the 28 base reads are aligned against the reference genome, and all perfectly mappable reads are captured, generating no mismatches.
- If the read did not map back in pass one, the two bases within the read with the lowest quality scores are blanked, and the remaining 26 bases are aligned against the reference genome, capturing all perfect 26 base mappable reads. For each of these mappable reads, the two blanked bases are checked for a match, generating either one or two mismatches.
- If the read did not map back in pass two, the aligner searches for mappable reads in 27 out of the 28 bases. All mappable reads are captured, and one mismatch will be generated.
- If the read did not map back in pass three, the two bases within the read with the lowest quality scores are blanked, and the aligner searches for mappable reads in the remaining 25 out of 26 bases. This will generate either two or three mismatches.
- This spreadsheet tabulates the matches and mismatches for each of the 1000 quality scores.
Object and Base Filtering: The base calls tabulated in this spreadsheet are a subset of the total number objects found in the scan that starts the sequencing run. For one thing, the basecaller analyzes run data on a lane by lane basis. In most cases, each lane will be loaded with a different genomic library. More importantly, a series of filters was used to eliminate beads or calls of low quality, as described below.
- During or after a sequencing run, the Polonator Image Processing GUI allows the operator to view histograms of the intensity distributions for any completed cycle. These histograms are typically bimodal, with the low intensity peak being associated with unamplified or marginally amplified beads, and the higher intensity peak being associated with properly amplified beads. The GUI allows the operator to set a fluorescent intensity cut-off for each lane, in order to cull unamplified or marginally amplified beads. With proper titration of template DNA during library construction, an optimal percentage of ~40% amplified beads can be achieved, or about 35 million amplified beads per lane.
- After ordering the cycles in 5’ to 3’ order, the Polonator basecaller allows the user to set a percentage of data to keep when creating the read file. This is typically set to about 50%. Setting the value higher will produce more reads, but these will include reads of lower quality. In this step, nonclonal and marginally amplified beads are culled.
- Finally, the aligner makes up to four passes on the entries in the read file, and only reads that map back to the reference genome are tabulated. The percentage of mappable reads output by the aligner is typically around 50 - 60% of the total reads included in the read file.