President's Corner - Whole Genome Sequencing Variant Files for Decision Support
Published on
About two years ago, I met with some leaders of a large Midwestern healthcare system. I gave them my standard pitch about the value of clinical whole genome sequencing (WGS). They responded with the usual “that’s nice” words, but then one bright woman asked, “but what will we do with the data?” Super question, which I will address here and in upcoming posts.
First, what data should we store? Not everyone agrees, but I think we should store the raw sequence data. Not because I think that the raw data will be used as input for clinical decision support, but rather because analysis and interpretation of the raw data will steadily improve over time. Raw data files are large and will cost non-trivial amounts to store, but if aggressively compressed, this can and should be done.
But what data will be used as input for decision support? It has to be a clean, accurate lists of variants, or the patients will be harmed. Garbage in; garbage out. This is where the labs come in. The raw vcf files that are output from programs like the GATK variant analysis software are full of junk. This is intentional. To avoid missing true variants, we cast a wide net and catch a lot of false, artifactual variants. It must be the job of the labs to retain the true variants but remove the junk.
This is not an easy task. To an extent, this can be accomplished by careful computer processing of the sequence data. However, computer analysis alone is insufficient. Extensive validation of these results using orthogonal lab methods and carefully selected controls will need to be performed. Each clinically relevant paralogous gene, for example, needs to be examined individually to determine which variants are from the functional gene and which from pseudogenes.
To effectively utilize WGS data in health care, sophisticated decision support software will need to be developed. These programs will require WGS variant data in a standard format as input. Variant call format (vcf) files are the current standard, but default vcfs lack information that is essential for clinical applications. For example, ACMG interpretations will need to be added to the variants, so (in most cases) only clinically relevant variants are presented to providers and patients. In addition, information needs to be appended specifying which segments of the genome were sequenced to sufficiently high quality to be used as input in decision support. It will take considerable work to set a standard file format and to include the essential information, but this is certainly an achievable task.
It’s our goal at PreventionGenetics to produce version 1.0 of a clean WGS variant file by the end of 2021. This will be another important step in making population level clinical WGS a reality.