ClinVar — a technical view

Ben Liesfeld
Limbus News
Published in
5 min readSep 19, 2017

--

ClinVar is one of the largest public repositories of genomic variant interpretation data. Although the Clinvar data is also available for download, additional processing steps are necessary to make the data useful for the interpretation of variants in other software tools.
Medical device manufacturers like Limbus must be extra cautious since they take responsibility that the data import into their platform does not alter the data from ClinVar. In our previous post we provided a short overview of ClinVar. It is now time to start a more technical discussion about what kind and what quality of information ClinVar provides.

First of all, these are the ClinVar variant annotations that we provide in varvis:

  1. ClinVSig: the clinical significance as determined from all related ClinVar submissions in the current ClinVar release.
  2. ClinVarStatus: the ClinVar review status of a variant. This is shown as a star rating in the Variation Report.
  3. ClinVAcc: link with ID to access the ClinVar Variation Report on the ClinVar website, represented by the rsID from dbSNP or the internal ClinVar variation ID, if the rsID is not available.

Executive summary

varvis generally provides the following standardized values of clinical significance from ClinVar data:

Benign
Benign/Likely_benign
Likely_benign
Conflicting_interpretations_of_pathogenicity
Uncertain_significance
Likely_pathogenic
Pathogenic/Likely_pathogenic
Pathogenic

Note that Benign/Likely_benign is a separate valid category provided by Variation Reports, but a conflict in RCVs. Same applies to Pathogenic/Likely_pathogenic.

If a specific variant has been reported as part of a compound variant, we indicate this with the prefix Compound: and display the classification of the compound variant. We thereby ensure that a variant that may be pathogenic in combination with another variant is not missed.

We follow the ClinVar recommendations and also display non-standardized terms such as risk_factor, association, drug_response, protective, other, not_provided, Affects .

Note that the review status of a Variation Report may not reflect the most recent state of clinical evidence. See this example which shows a variant reviewed as benign by an expert panel where new evidence was submitted by GeneReviews as pathogenic. It appears that the status is not re-set automatically, if new clinical evidence emerges, and may therefore be outdated.

For those of you who would like to dive a little deeper into the issue here is a description of the most prevailing technical pitfalls that we ran into:

Introduction: RCVs, SCVs, variation IDs

There is a lot of documentation on ClinVar, but this can also be a little overwhelming.

xkcd #1343 https://xkcd.com/1343/

ClinVar has accessions for submissions (e.g., SCV000012345) and variant/phenotype relationships (e. g., RCV000012345). Additionally, ClinVar aggregates all data for a variant (or a set of variants, e.g. for compound-heterozyguous variants) and associates this record with a so-called variation ID.

Since we want to annotate individual variants with the data, these IDs are what we are most interested in: IDs that can be used on the ClinVar website to access all information on the given variant.
For example, RCV000012345 belongs to the variation ID 11588, which is based on one submission, SCV000032579. This submission does not have a separate page, but you can see lists of submissions on the bottom of the RCV- and variation report pages.

Issues with ClinVar data

In the following, we highlight a few examples of issues that we encountered when writing software to regularly import ClinVar data.

eXpensive Markup Language

ClinVar comes as a huge XML file, e. g. /ClinVarFullRelease_2017-08.xml.gz: this ought to be "[t]he most detailed comprehensive file [...]" (from the documentation), so we went with this 5GB XML file and the 40-odd pages of a 'data dictionary' documenting its structure.

Documentation

The data dictionary for the XML structure seems to be outdated (although the release date suggests otherwise). For example, the XML element MeasureTrait is mentioned quite a bit in the documentation (over 30 times) but is nowhere to be found.

Variant representation

ClinVar right-shifts variants, and not only in their HGVS cDNA and protein change notation, but also with their genomic coordinates. Per HGVS recommendation, cDNA and protein changes should always be right-shifted, so this is also done in varvis. However, we keep the genomic coordinates left-shifted so that we can still annotate each variant as good as possible (via its minimal representation). Otherwise, those variants could not be found in other data sources like gnomAD, dbSNP etc. and would thus not be annotated correctly. Moreover, Yen et al. note that ClinVar also does not right-shift properly in all cases, as it often seems to ignore intron/exon boundaries when doing so.

Variant uniqueness

After left-shifting and normalization, ClinVar’s variation IDs are not necessarily unique anymore. For example chr1@76226858:G>GCTAGAATGAGTTA (hg19, 1-based coordinates) refers to both c.1012_1013insTAGAATGAGTTAC and c.999_1011dupTAGAATGAGTTAC. Each variation ID can in turn have multiple RCV records (and each, in turn, multiple SCV records) associated. To resolve this issue, we sort variations by pathogenicity and review status, and then choose the first.

Variation 404

We also found that not all variation IDs have valid ‘variation report’ pages on the ClinVar website (although their RCV records have, and also link to them). For example, RCV000022411 links to the page of the corresponding variation ID 1388, which does not exist.

Determination of clinical significance

Another issue of confusing documentation, but quite critical for variant annotation and thus deserving a special mention, is the determination of clinical significance.

It is well-documented by the ClinVar team that the clinical significance displayed for RCVs differs from the variation reports, mostly by using a more lenient definition of what is conflicting data for the latter (e. g. the supported ACMG-recommended significances Benign and Likely benign would not be regarded as a conflict in a variation report, but they would be marked as a conflict in an RCV record).

This means that it is not enough to consider all RCV records per variation ID; all SCV records and their clinical significance need to be considered instead. The relevant data cannot be retrieved from the clinical significances associated with the RCVs.

About varvis and allexes

We address the issues of public variation data repositories with our data network allexes™ which is integrated into our varvis™ platform. In allexes, we collect complete variant and phenotype data in a standardized, de-identified and automated process across many organizations. allexes is scalable and provides transparency about data quality.

About varvis: the clinical decision support system varvis, designed by Limbus Medical Technologies GmbH, Rostock, Germany, is a cloud-based software system to filter and evaluate genetic sequencing data. The varvis platform significantly increases the efficiency and diagnostic yield in clinical genetics. varvis is a Class I medical device compliant with EU regulations. For more information about varvis, visit http://www.varvis.com.

--

--

Excited about the impact of genetic diagnostics on patients’ lives. Founder of a genomics software company.